1. Introduction
In modern aerospace manufacturing, aero-engine components require extremely high standards of dimensional accuracy, structural reliability, and operational safety. Bearings, as one of the critical rotating components in aero-engine systems, play an essential role in supporting high-speed rotating shafts and ensuring stable power transmission [
1]. In aerospace propulsion systems, accurate monitoring and diagnosis of bearing conditions are crucial for ensuring operational safety and preventing catastrophic failures. Recent studies have explored intelligent diagnostic methods based on deep learning to improve the robustness of bearing fault identification in complex operating environments [
2]. Minor surface defects—such as cracks, scratches, burrs, pitting, and poor polishing—may originate from complex machining processes, assembly stresses, or harsh operating environments. Although these defects are usually small in size, they can act as stress concentration points, triggering early fatigue failure and significantly reducing the reliability of aero-engine components. Therefore, achieving high-accuracy, automated, and real-time detection of bearing surface defects is crucial for intelligent inspection and precision measurement in advanced manufacturing.
Traditional bearing surface defect detection methods mainly rely on manual inspection by human operators or conventional image processing-based algorithms [
3], such as threshold segmentation, edge detection, and texture analysis. These approaches are relatively effective when dealing with homogeneous backgrounds or single-type defects; however, their capability to identify defects under complex backgrounds, low-contrast conditions, or multi-scale scenarios remains limited. Moreover, such traditional methods generally exhibit poor robustness and are highly sensitive to variations in illumination, noise, and texture interference, making them inadequate for the diverse and high-speed inspection requirements of modern industrial production [
4]. With the continuous progress of deep learning and computer vision, surface defect detection has increasingly relied on CNN (Convolutional Neural Network)-based object detection methods, which benefit from unified feature extraction and prediction within a single learning framework [
5]. For instance, Xia et al. [
6] proposed an improved Faster R-CNN-based surface defect detection algorithm, in which a feature pyramid network built upon a ResNet-50 backbone with deformable convolutions was employed to enhance the representation of multi-scale defect features, thereby addressing challenges arising from diverse defect types and complex geometries. Zhang et al. [
7] introduced an improved YOLOv5 model that integrates a multi-scale feature fusion strategy with a CSPLayer Res2Attention residual module, significantly strengthening defect feature extraction and aggregation, and consequently improving classification and localization accuracy. In aerospace applications, intelligent diagnosis frameworks have also been proposed to analyze bearing health conditions under complex working environments. For example, federated learning combined with self-attention mechanisms has been explored to improve the accuracy and robustness of aerospace bearing fault diagnosis [
8]. Among these approaches, the YOLO (You Only Look Once) series models, characterized by lightweight architectures, high detection speed, and competitive accuracy, have been widely adopted in industrial visual inspection tasks [
9]. These intelligent visual inspection methods provide an important technical foundation for automated defect measurement and quality control in modern aerospace manufacturing.
Despite the remarkable success of the YOLO series in general object detection, several challenges remain when applying these models to bearing surface defect detection. First, bearing defects are typically characterized by small scales, irregular shapes, and low contrast [
10], making it difficult for conventional convolutional structures to effectively capture fine-grained edge and texture information. Second, although lightweight variants such as YOLOv8n offer high inference speed, the fusion between deep semantic features and shallow texture features is often insufficient [
11], limiting detection performance in complex multi-scale scenarios. Furthermore, the lack of targeted enhancement mechanisms for critical features in deep networks makes them vulnerable to background interference in complex industrial environments, thereby reducing defect discriminability [
12]. In addition, conventional downsampling strategies tend to cause the loss of subtle defect information during feature map compression [
13], which is detrimental to subsequent multi-scale feature fusion and precise localization. Consequently, further improving detection accuracy and model robustness while maintaining real-time inference speed [
14] remains an urgent challenge.
In response to the above limitations, an improved lightweight bearing surface defect detection model, named DMR-YOLO, was developed in this study. The proposed model is built on YOLOv8n and incorporates multi-level structural optimizations. Specifically, the main contributions of this work are summarized as follows:
(1) A task-oriented re-parameterized feature extraction module (C2f-DBB) is designed by embedding the Diverse Branch Block into the YOLOv8 backbone and neck. Unlike conventional static convolution structures, the proposed design enables adaptive multi-branch feature aggregation during training while maintaining single-path efficiency during inference. This improves the representation of irregular and small-scale defect patterns under complex industrial backgrounds.
(2) A lightweight multi-level channel attention module (C2f-MLCA) is constructed and integrated into selected backbone layers. Different from conventional channel attention mechanisms, the proposed design jointly models local and global channel dependencies, specifically enhancing the discrimination of low-contrast defect features against noisy metallic textures.
(3) A ResidualADown module is proposed to address information loss in conventional downsampling. By introducing a residual information preservation path, the module improves the retention of fine-grained spatial details, which is critical for detecting subtle defects such as micro-cracks.
(4) More importantly, a collaborative optimization strategy is developed by integrating DBB, MLCA, and ResidualADown into a unified lightweight framework. The three modules are designed to complement each other in feature extraction, feature selection, and information preservation, respectively, forming a synergistic mechanism that enhances detection performance beyond simple incremental improvements.
Through these improvements, the proposed DMR-YOLO model maintains a lightweight design and high inference speed while enhancing multi-scale feature representation and small-target detection performance. Experimental results show that DMR-YOLO achieves higher detection accuracy and stability than existing lightweight models on the bearing surface defect dataset. These results indicate its potential for intelligent visual inspection in surface defect detection tasks, including applications in aero-engine component inspection. In aerospace manufacturing, defect detection is not only required to identify defect categories, but also to support reliability-oriented inspection processes. In practical scenarios, automated visual inspection systems are typically used as a preliminary screening tool to assist human experts, where high recall is critical to avoid missing potential defects, while real-time performance is required for production efficiency. Therefore, developing a lightweight and high-sensitivity detection model is an important step toward intelligent inspection in aerospace manufacturing pipelines, although it does not replace subsequent quantitative evaluation and certification procedures.
The remainder of this paper is organized as follows:
Section 2 briefly reviews the research status of bearing defect detection technologies and introduces the fundamental concepts of the YOLOv8n model.
Section 3 presents the overall architecture of the proposed DMR-YOLO model and elaborates on the key improvement strategies.
Section 4 describes the experimental setup and provides a detailed analysis of the experimental results to validate the performance of the proposed model.
Section 5 conducts generalization experiments of the improved model on additional datasets. Finally, the conclusions are drawn, and potential directions for future research are discussed.
3. Improvements of YOLOv8 Algorithm
The architectural design of DMR-YOLO is fundamentally driven by the unique challenges of industrial bearing defect detection, such as low contrast, fine-grained features (e.g., micro-cracks), and the demand for real-time edge deployment. The improved YOLOv8n algorithm is illustrated in
Figure 2. In the backbone network, the C2f-DBB module replaces the original C2f modules at the 2nd and 4th layers, while in the neck network, it is applied at the 12th, 15th, 18th, and 21st layers. This choice stems from the need to capture irregular defect geometries; by incorporating multiple branch convolutions during training and fusing them via structural re-parameterization for inference, the C2f-DBB module enhances feature diversity and multi-scale perception capability without increasing the computational burden during deployment.
Additionally, a Multi-Level Channel Attention (MLCA) mechanism is introduced at the 6th and 8th layers of the backbone network. The original convolution modules are replaced with C2f-MLCA, which adaptively adjusts feature weights across multiple channel levels. This is specifically designed to handle the complex optical conditions of metallic bearing surfaces; by recalibrating channel-wise importance, the mechanism strengthens the model’s focus on critical defect signatures while effectively suppressing background noise caused by specular reflections and uneven lighting.
Finally, the specially designed ResidualADown module is integrated into the 3rd, 5th, and 7th layers. Unlike standard downsampling that often acts as an information bottleneck, this module reduces spatial resolution while maximally preserving important feature information through a residual bypass. This ensures that pixel-level spatial details of subtle defects are not discarded during the deepening of the network, thereby mitigating information loss during downsampling.
It is important to emphasize that the proposed modules are not simply combined in a parallel or independent manner. Instead, they are designed to address different but interrelated challenges in industrial defect detection.
Specifically, the C2f-DBB module enhances the diversity and adaptability of feature extraction, enabling better representation of irregular defect patterns. The C2f-MLCA module further refines these features by selectively emphasizing informative channels and suppressing background noise. Meanwhile, the ResidualADown module preserves critical spatial information during downsampling, ensuring that subtle defect details are not lost in deeper layers. Through this complementary design, the three modules form a synergistic pipeline of “feature enhancement—feature selection—information preservation,” which leads to more robust detection performance compared with isolated or naively combined improvements.
3.1. Re-Parameterization Module C2f-DBB
In the original YOLOv8n, feature extraction and fusion are performed using the C2f module. In this module, the convolutional kernel parameters are fixed after training, and all branch convolutions follow a static structure, which limits the diversity of feature representation. However, bearing surface defects are typically extremely small, with weak and highly variable visual characteristics. The fixed kernel sizes and shapes in the standard C2f module are insufficient to effectively capture such fine-grained features, leading to suboptimal detection performance in complex backgrounds. To address this limitation, a diverse branch module based on multi-branch convolution and structural re-parameterization, termed C2f-DBB, is introduced to replace the C2f modules in YOLOv8n.
In C2f-DBB, the standard convolution operations within the Bottleneck structure are replaced with a multi-branch convolutional design. By incorporating parallel branches with heterogeneous convolution kernel, the module enhances the diversity of feature extraction at multiple receptive fields. This design enables more effective representation of small, irregular, and multi-scale defect patterns. Meanwhile, the multi-branch structure is equivalently fused into a single convolution during inference via structural re-parameterization, ensuring that no additional computational overhead is introduced. As a result, the proposed module improves feature representation capability while maintaining high inference efficiency, making it suitable for complex industrial defect detection scenarios. The structure of C2f-DBB is illustrated in
Figure 3.
The DBB (Diverse Branch Block) module is a convolutional building block based on structural re-parameterization, which enhances feature representation diversity by introducing a multi-branch topology during training while maintaining inference efficiency. The overall structure of the DBB module is illustrated in
Figure 4. Let the input feature map be denoted as:
The DBB module consists of four parallel branches, and the final output feature map is obtained by element-wise summation of all branch outputs. This operation can be formulated as follows:
where
denotes the feature transformation function of the
i-th branch.
Branches 1 and 4 are single-layer convolution branches, which adopt 1 × 1 convolution and K × K convolution, respectively, followed by Batch Normalization (BN). These two branches introduce linear and spatial feature transformations with different receptive fields, enhancing feature diversity and stabilizing feature distributions. This operation can be formulated as follows:
Branch 2 first applies a 1 × 1 convolution to compress and reorganize channel information, followed by a K × K convolution to capture local spatial features. Batch Normalization is applied after each convolution to improve training stability. This operation can be formulated as follows:
This branch focuses on extracting fine-grained spatial details and local texture information, which is particularly beneficial for detecting small and irregular defect patterns.
Branch 3 first employs a 1 × 1 convolution followed by BN to adjust channel representations and then applies Average Pooling to aggregate neighborhood information and enhance contextual robustness. This operation can be formulated as follows:
This branch introduces a smoothing effect and strengthens global contextual perception, improving robustness against background noise.
By integrating convolutional features from branches with heterogeneous receptive fields and modeling behaviors, the DBB module enhances the network’s ability to capture fine-grained details while preserving broader contextual representations. The outputs of all four branches are then integrated via element-wise summation to form the unified output feature map. This operation can be formulated as follows:
During inference, the DBB module can be equivalently folded into a single convolution through structural re-parameterization, such that:
where the convolution kernel weights and biases of
are obtained by linearly combining the convolution and BN parameters from all branches. This re-parameterization is applied only during inference, fully retaining the performance benefits of the multi-branch structure during training without introducing additional inference overhead.
The integration of the DBB module is specifically motivated by the inherent trade-off between detection precision and inference latency in industrial scenarios. Given the diverse and irregular geometries of bearing surface defects (e.g., irregular burrs and fine-grained cracks), a robust multi-scale receptive field is essential. By employing asymmetric convolutions and multi-branch structures during the training phase, DBB significantly enhances the model’s capacity for complex feature extraction. Subsequently, these branches are fused into a single-path equivalent convolution via structural re-parameterization for inference. This design ensures that DMR-YOLO achieves superior feature diversity without imposing additional computational overhead on edge-constrained devices.
3.2. Attention Mechanism C2f-MLCA
In bearing surface defect detection tasks, defect regions usually occupy only a small portion of the image, while background textures may dominate the feature responses. This often leads to insufficient discrimination between defect and non-defect regions. To address this issue, a multi-level channel attention mechanism is introduced to enhance informative channel features and suppress irrelevant background responses. The C2f-MLCA module is a lightweight feature enhancement structure derived from the C2f module, following a “split–enhance–fuse” workflow, as illustrated in
Figure 5. The input feature map first passes through a CBS module for basic convolutional preprocessing. It is then split along the channel dimension via a Split operation: one portion of the features is retained as a shortcut, while the remaining portion is fed into a feature enhancement branch. The enhancement branch consists of repeated units of Bottleneck + MLCA, where the Bottleneck unit uses a lightweight residual structure to flexibly adjust channel dimensions, and the Multi-Level Channel Attention (MLCA) module integrates both local and global feature perception mechanisms. This allows the network to capture spatial details and channel dependencies simultaneously, achieving targeted enhancement of critical features. Finally, the processed branch features are concatenated with the shortcut features along the channel dimension through a Concat operation, and the output CBS module unifies the channel dimensions, completing the integration of multi-scale and multi-dimensional information. The resulting enhanced feature map is then output.
The MLCA module combines channel and spatial feature information to improve the representational capability of the network, as shown in
Figure 5. First, the input feature map undergoes local average pooling, producing a tensor of shape 1 × C × K × K to extract local spatial features. The tensor is then split into two parallel branches: one branch captures global contextual information through global average pooling, while the other captures fine-grained spatial information via local average pooling. Subsequently, features from both branches are transformed through 1D convolutions and upsampled to the original spatial resolution using reverse average pooling. The fused local and global channel attention weights
are then generated according to the following formula:
Here, denotes the Sigmoid activation function, and represents a one-dimensional convolutional transformation.
The attention feature maps from the two branches are then fused in an element-wise manner, producing the final feature map that integrates both local and global information. This feature map assigns multi-scale attention weights along both the channel and spatial dimensions, thereby significantly enhancing the representational capability of the network.
In
Figure 5,
denotes a one-dimensional convolution, where the kernel size
is proportional to the number of channels
, and their relationship is defined as follows:
Here, denotes the number of channels, is the convolutional kernel size, and and are hyperparameters, both set to 2 by default. The term odd indicates that is rounded to the nearest odd number; if the result is even, 1 is added to ensure an odd kernel size.
In this work, the MLCA algorithm is integrated into the original C2f module to form the C2f-MLCA module. This integration enhances the network’s ability to perceive both channel and spatial information without significantly increasing computational complexity. The introduction of this module effectively improves feature extraction capability and strengthens the network’s performance in detecting targets of varying scales.
3.3. ResidualADown Module
Downsampling operations in convolutional neural networks often lead to the loss of important feature information. To address this issue, a ResidualADown module is designed to improve feature propagation during the downsampling process. By introducing a residual connection, the proposed module can effectively preserve useful feature information while reducing spatial resolution, thereby enhancing the representation capability of the network, as illustrated in
Figure 6. Let the input feature map be defined as:
The input feature map is first processed by an AvgPool2d layer for preliminary downsampling, which reduces spatial resolution while preserving global contextual information. This process can be expressed as:
Subsequently, the downsampled feature map is divided into two branches along the channel dimension via a Chunk operation:
In Branch 1, the features are directly fed into a convolutional layer to extract local detail information, which can be formulated as:
Branch 2 first applies a MaxPool2d layer to further compress the spatial dimensions and emphasize salient response regions, followed by a convolutional transformation. The computation is given by:
Finally, the outputs of the two branches are concatenated through a Concat operation to obtain the downsampled feature map with fused multi-scale information:
Although the above dual-branch structure enables effective multi-scale feature extraction during downsampling, a certain degree of information loss may still occur during progressive feature transformation. To further alleviate this issue, a residual path is introduced to enhance feature propagation and preserve complementary information from the input feature map, as illustrated in
Figure 6. Structure of the ResidualADown module.
Specifically, the ResidualADown module adopts a dual-path parallel architecture, consisting of a main path and a residual path. The main path performs the downsampling and feature extraction operations described above, while the residual path directly conveys part of the original feature information to the output.
In the residual path, the input feature map is first processed by an AvgPool2d layer to perform spatial downsampling, followed by a convolutional layer to adjust the channel dimension, ensuring that the output feature map is consistent with the main path in both spatial resolution and channel size. The residual path can be expressed as:
Subsequently, the output of the residual path is combined with the main path output via element-wise addition to achieve feature fusion. The overall output of the ResidualADown module is given by:
where
denotes the output of the main path and
represents the output of the residual path. By employing lightweight pooling and convolution operations, the residual path ensures effective feature alignment and fusion in both spatial and channel dimensions. This residual structure can be regarded as introducing an identity-preserving mechanism during the downsampling stage, which is consistent with the residual learning philosophy of ResNet. It helps mitigate feature degradation and information loss in deep networks. ResidualADown addresses the information bottleneck problem in traditional downsampling by introducing a residual shortcut that preserves high-frequency spatial details (e.g., pixel-level cracks) typically discarded by standard strided convolutions. By combining the multi-scale fused features from the main path with the original feature representations retained in the residual path, the module enhances the stability and integrity of feature representations while maintaining lightweight characteristics, thereby providing a more informative foundation for subsequent high-level semantic modeling and improving the detection of subtle manufacturing defects.
4. Results and Evaluation
4.1. Dataset
In this study, a bearing casting surface condition dataset, legally obtained from a commercial third-party source, was used to simulate surface defect inspection scenarios in industrial component manufacturing. The dataset is not publicly available and is used exclusively for academic research purposes. All defect categories in the dataset are predefined and manually annotated based on their visual characteristics, rather than explicit geometric or physical measurements such as size or depth. All images in the dataset are annotated using bounding boxes that enclose defect regions, following standard object detection labeling practices. Although the dataset used in this study is collected from industrial bearing manufacturing scenarios rather than directly from aerospace production lines, the defect types (e.g., cracks, pits, and scratches) and surface characteristics are highly relevant to those encountered in aero-engine component inspection. Therefore, the dataset is used as a representative benchmark to simulate visual inspection tasks in aerospace manufacturing environments. Bearings are widely used as critical rotating components in aero-engine systems, and their surface quality plays an important role in ensuring operational reliability. The dataset covers a variety of surface conditions that may occur during the production and machining processes of bearing castings, including eight defect categories: Casting burr, Polished casting, Burr, Crack, Pit, Scratch, Strain, and Unpolished casting. The dataset consists of 2561 images in the training set and 732 images in the validation set, with an image resolution of 640 × 640 pixels. The training-to-validation split ratio is 3:1.
In terms of class distribution, the number of samples in each defect category is generally balanced, although slight variations exist due to differences in the frequency of defect occurrence in real manufacturing processes. This reflects practical industrial conditions and avoids excessive bias toward specific defect types.
Due to data usage restrictions imposed by the provider, the dataset cannot be publicly released. However, all experimental configurations, training procedures, and evaluation metrics are fully described in this study to ensure the reproducibility of the proposed method. The model can be readily applied to other similar surface defect datasets. Representative image samples from the dataset are shown in
Figure 7.
4.2. Implementation Details
All experiments were conducted on a Windows 11 operating system. The hardware platform is equipped with an Intel Core i7-14650HX CPU and an NVIDIA GeForce RTX 5060 GPU with 16 GB of video memory. Detailed hardware configurations are listed in
Table 1. Experimental Environment. During training, the number of training epochs was set to 200, with a batch size of 16. The initial learning rate was 0.01, and the momentum parameter was set to 0.937. The stochastic gradient descent (SGD) optimizer was employed for model optimization. The detailed training hyperparameters are summarized in
Table 2. Experimental Parameters. All models in the comparison experiments are trained under identical experimental settings, including the same training and validation splits, data preprocessing, training epochs, optimization parameters, and hardware environment, to ensure a fair comparison. To reduce the influence of randomness in model training and ensure the reliability of the experimental results, each experiment in this study was independently repeated three times, and the final reported results are the average values of the three runs.
4.3. Evaluation Indicators
To accurately evaluate the detection accuracy and efficiency of the proposed improved algorithm for bearing surface defects, Precision (P), Recall (R), and mean Average Precision (mAP) are adopted as evaluation metrics, which are used to assess category-level detection performance. Misclassification cases are counted as incorrect detections and are reflected in the evaluation metrics. Their mathematical definitions are given as follows:
where TP denotes the number of samples in which defect regions are correctly detected, FP represents the number of samples that are incorrectly detected as defects when no defect is present, and FN indicates the number of defect samples that fail to be detected.
Since bearing surface defect detection involves multiple defect categories, Precision (P) and Recall (R) alone are insufficient to comprehensively evaluate the performance of the model. During the training phase, a corresponding Precision–Recall (PR) curve can be generated for each defect category, and the area under the PR curve represents the Average Precision (AP), which is defined as:
The mean Average Precision (mAP) denotes the average of the AP values across all categories and is calculated as:
In this study, the term “detection accuracy” refers to the overall detection performance of the model, which is comprehensively evaluated using Precision (P), Recall (R), and mean Average Precision (mAP). Among these metrics, mAP is considered the primary indicator, as it reflects both localization and classification performance across all defect categories.
In industrial defect detection scenarios, the definition of satisfactory detection accuracy is typically task-dependent and varies with dataset characteristics, defect complexity, and application requirements. In general, a model is considered practically effective when it achieves a high mAP while maintaining strong recall to minimize missed detections.
4.4. Ablation Study
To evaluate the effectiveness of the proposed DMR-YOLO compared with the baseline YOLOv8n model and to verify the contribution of each individual improvement, a series of ablation experiments were conducted. All experiments were performed under the same experimental environment with identical hyperparameter settings to ensure fair comparison. Starting from the baseline YOLOv8n model, the proposed improvement modules were gradually introduced one by one. The detailed ablation results are presented in
Table 3. Comparison of Ablation Experiment Results.
Experiment 1 serves as the baseline model, achieving an mAP of 86.5%, which provides a clear reference framework for evaluating the performance gains of subsequent modules. In Experiment 2, the original C2f modules in the baseline model are replaced with the proposed C2f-DBB modules, resulting in an increase in mAP to 87.5%. Meanwhile, Precision and Recall remain nearly unchanged, with only a slight increase in the number of parameters and computational cost. This indicates that the DBB module effectively enhances feature representation capability at an acceptable computational overhead, contributing steadily to detection accuracy. In addition, a slight improvement in FPS is observed. This can be attributed to the structural re-parameterization mechanism of the DBB module, where the multi-branch structure used during training is equivalently fused into a single convolution during inference, leading to more efficient execution. In Experiment 3, the C2f-MLCA modules are introduced at the 6th and 8th layers of the backbone network to replace the original C2f modules. The results show that both detection accuracy and computational cost remain comparable to those of the baseline model, suggesting that MLCA primarily functions in feature distribution adjustment and background noise suppression rather than significantly strengthening deep semantic representations when used independently. Consequently, the performance gain of the MLCA module alone is relatively limited. Experiment 4 incorporates the proposed ResidualADown module, leading to a notable improvement in mAP to 88.4%. At the same time, both Precision and Recall are enhanced, while the overall computational cost is reduced. These results demonstrate that ResidualADown achieves an efficient and stable improvement in detection performance, making it the most effective single-module enhancement among the evaluated components.
On this basis, Experiments 5–7 investigate different combinations of the proposed improvement strategies. In Experiment 5, the integration of DBB and MLCA results in an mAP of 88.2%, outperforming the corresponding single-module experiments. However, this performance gain is accompanied by an increase in model parameters and computational complexity, leading to a slight reduction in FPS. This indicates that although multi-strategy fusion can improve detection performance, it inevitably introduces additional computational overhead. In Experiments 6 and 7, the ResidualADown module is incorporated while maintaining nearly stable Precision and Recall values, resulting in varying degrees of reduction in parameter scale and computational complexity. These results demonstrate the complementary relationship among different modules in terms of balancing performance and efficiency.
In Experiment 8, the complete architecture integrating all three proposed modules—C2f-DBB, C2f-MLCA, and ResidualADown—is adopted to construct the final model, termed DMR-YOLO. As shown in
Table 3, the proposed model achieves an mAP of 89.3%, which is the best performance among all experiments. The results indicate a clear synergistic effect among the three components. Notably, while the MLCA module provides limited performance improvement when applied individually (Exp. 2), its contribution becomes significant when combined with ResidualADown and DBB (Exp. 8), resulting in a substantial increase in detection accuracy. This demonstrates that DMR-YOLO operates as an integrated system rather than a simple combination of independent modules, as the interaction between feature enhancement (DBB), feature recalibration (MLCA), and information preservation (ResidualADown) leads to a cooperative optimization process. Specifically, ResidualADown preserves subtle texture and defect details during the downsampling process, DBB strengthens local feature representation, and MLCA performs cross-scale attention modulation to filter critical signals from complex metallic background noise. Through their complementary roles in deep semantic enhancement, cross-scale feature aggregation, and feature selection, the three modules collectively improve the model’s robustness to low-contrast and small-scale defects. Moreover, the final model maintains high inference speed while achieving superior detection accuracy, demonstrating the effectiveness and practical applicability of DMR-YOLO in real-world industrial scenarios.
4.5. Model Comparison and Visualization Analysis
To further verify the superiority of the proposed model in bearing surface condition detection, several representative object detection algorithms in the related field are selected for comparative evaluation, including YOLOv3-tiny, YOLOv5, YOLOv7, YOLOv8, YOLOv9, and YOLOv10n. The performance of each method is evaluated using multiple metrics, namely Precision (P), Recall (R), mean Average Precision (mAP), Frames Per Second (FPS), number of parameters (Parameters/10
6), and computational complexity (GFLOPs). The comparative results are summarized in
Table 4. Performance Comparison of YOLO Series Models.
As shown in
Table 4. Performance Comparison of YOLO Series Models, there are significant differences among the compared YOLO-based models in terms of precision, recall, mean average precision, detection speed, parameter size, and computational complexity. Although YOLOv3-tiny achieves relatively high precision and recall with a competitive mAP, its inference speed is relatively low and the computational cost remains high. In contrast, YOLOv5n substantially reduces computational complexity and significantly improves detection speed; however, this improvement comes at the cost of a noticeable degradation in both mAP and recall. As the baseline model in this study, YOLOv8n demonstrates a favorable balance between accuracy and efficiency. It achieves strong precision and recall performance, with an mAP of 86.5%, while maintaining relatively low computational complexity and fast inference speed. By comparison, YOLOv9t suffers from a considerable decrease in detection speed, limiting its suitability for real-time industrial applications. Although YOLOv10n exhibits advantages in terms of model lightweightness, its detection accuracy is unsatisfactory, making it less competitive in precision-critical defect detection tasks. Overall, the experimental results indicate that the proposed DMR-YOLO consistently outperforms the other YOLO-series models. It achieves the highest precision, recall, and mAP (89.3%), while maintaining high inference speed and a relatively small number of parameters. This demonstrates that the proposed improvements effectively enhance detection performance without sacrificing real-time capability.
After 200 training epochs, the improved DMR-YOLO model exhibits clear advantages across all key evaluation metrics. Its precision, recall, and mean average precision surpass those of the compared models, confirming its superior performance and adaptability in practical bearing surface defect detection scenarios. To further visualize and quantitatively compare the average precision of different models across various defect categories, an AP comparison chart for each defect class is provided in
Figure 8. Comparison of AP values of different algorithms in each defect category, highlighting the effectiveness of the proposed method in recognizing specific defect types.
Figure 8 provides an intuitive comparison of the average precision (AP) of different YOLO-series algorithms and the proposed DMR-YOLO model across eight categories of bearing surface defects. Overall, all models perform exceptionally well on categories such as Casting burr, Polished casting, and Burr, with AP values approaching or exceeding 0.99. This indicates that these defects have relatively distinct visual characteristics, making them easier to detect. However, for categories including Crack, Pit, Scratch, Strain, and Unpolished casting, significant differences in performance among the models are observed. Notably, the proposed DMR-YOLO achieves the highest AP values in most of these challenging categories, with particularly marked improvements in Crack, Scratch, and Strain. These results demonstrate that the introduced enhancements effectively strengthen the model’s ability to recognize subtle cracks, scratches, and deformations that are typically difficult to detect. This figure, therefore, validates at the category level that DMR-YOLO maintains high robustness while achieving superior sensitivity and generalization for hard-to-detect defects.
To visually corroborate the quantitative performance advantages presented in
Table 4 and
Figure 8, and to specifically illustrate the improvement in defect recognition in practical scenarios, representative samples containing typical hard-to-detect defects were selected for qualitative comparison. As shown in
Figure 9, detection results from the baseline YOLOv8n model and the proposed DMR-YOLO model are displayed alongside the original images. It is clearly observed that the improved model significantly reduces missed detections and false positives, while enhancing both localization accuracy and classification confidence. These instance-level visualizations further confirm the enhanced practical applicability and reliability of the proposed DMR-YOLO model in real-world industrial defect detection.
To visually demonstrate the detection performance of the DMR-YOLO model on bearing surface defects, the confusion matrices of the proposed model and the baseline YOLOv8n model were compared, as shown in
Figure 10. In the confusion matrices, the rows represent the ground-truth classes, while the columns correspond to the predicted classes. The main diagonal reflects the correct recognition rate for each defect category, whereas the off-diagonal elements indicate misclassification rates. From the comparison, it is evident that the main diagonal of the DMR-YOLO confusion matrix exhibits darker color intensities and higher numerical values compared to YOLOv8n, indicating a higher correct detection rate across all bearing defect categories. The improved model shows notable gains in detection accuracy, particularly for Scratch and Strain defects, where precision increases from 57.8% and 80.7% to 64.2% and 88.6%, respectively. Overall, the mean average precision (mAP) across all defect categories rises from 86.5% to 89.3%, representing an improvement of 2.8% in detection accuracy.
To intuitively demonstrate the stability of the DMR-YOLO model in bearing surface defect detection,
Figure 11 illustrates the epoch-wise progression of mAP@50-95 for both YOLOv8 and DMR-YOLO across 200 training epochs. DMR-YOLO exhibits a generally higher mAP@50-95 than the baseline, particularly after the initial 50 epochs, and shows a smoother convergence trajectory, suggesting improved stability. The persistent performance advantage indicates that the gains introduced by DMR-YOLO are reliable under identical experimental conditions. Overall, DMR-YOLO achieves a favorable balance of detection accuracy, stability, and computational efficiency, with a modest increase in model size (3.60 M parameters) and computational cost (10.6 GFLOPs) compared to YOLOv8.
6. Conclusions and Prospects
To improve both the accuracy and speed of surface defect detection for critical mechanical components while maintaining a lightweight model, this study proposes an enhanced defect detection model termed DMR-YOLO based on YOLOv8n. The proposed method aims to provide an efficient intelligent visual inspection solution for industrial component manufacturing, particularly for applications requiring high reliability such as aero-engine component inspection. Through systematic experimental design and analysis, the following conclusions are drawn:
- (1)
Ablation experiments verify the effectiveness and complementary contributions of each integrated module. The C2f-DBB module incorporates the Diverse Branch Block structure into the backbone network, enhancing feature diversity through multi-branch convolutional representations and improving the extraction of fine-grained defect features. The C2f-MLCA module introduces a multi-level channel attention mechanism that adaptively emphasizes informative feature channels while suppressing background interference. In addition, the proposed ResidualADown module introduces residual connections into the downsampling stage, improving feature information preservation during spatial resolution reduction. Experimental results indicate that the coordinated combination of these components enables the final model to achieve an mAP of 89.3%, representing a 2.8% improvement over the baseline YOLOv8n while maintaining high inference efficiency. The model contains only 3.60 M parameters, demonstrating a favorable balance between detection accuracy and computational efficiency. Moreover, the lightweight design with relatively low parameter count and computational cost shows promising efficiency characteristics that are favorable for potential edge-oriented deployment scenarios in industrial environments.
- (2)
Comparative experiments with several mainstream detection models, including YOLOv3-tiny, YOLOv5n, YOLOv8n, YOLOv9t, and YOLOv10n, demonstrate that DMR-YOLO achieves the best overall performance in key evaluation metrics, including precision (86.15%), recall (93.93%), and mAP (89.3%). Particularly for challenging defect categories such as Crack, Scratch, and Strain, the model achieves noticeable improvements in AP values, highlighting its effectiveness for bearing surface defect detection tasks.
- (3)
Generalization experiments conducted on a steel surface defect dataset further confirm the superiority of DMR-YOLO. Compared with the baseline model, it achieves higher precision, recall, and mAP with only a marginal increase in parameters and computational cost, demonstrating strong generalization capability across different industrial defect datasets.
In summary, DMR-YOLO constructs an efficient lightweight detection framework by integrating structural re-parameterization, channel attention mechanisms, and an improved downsampling strategy within the YOLOv8n architecture. The experimental results demonstrate that the systematic integration of these complementary components can effectively improve detection performance while maintaining computational efficiency. Nevertheless, several limitations remain. Under extremely complex backgrounds, strong noise interference, or very small defect targets, localization and classification performance may still fluctuate, occasionally leading to missed or false detections. It should be noted that the proposed method focuses on defect detection and localization based on image data, and does not directly provide quantitative assessment of defect severity, such as geometric dimensions or acceptability criteria. In aerospace applications, such evaluations typically require additional measurement and domain-specific standards. Therefore, the proposed method is intended to serve as an auxiliary tool in the inspection pipeline, providing efficient preliminary screening rather than final decision-making. Future work will focus on evaluating the proposed method on larger-scale industrial datasets and exploring deployment on practical edge computing platforms for intelligent inspection and quality monitoring of aero-engine components. In addition, lightweight optimization techniques such as network pruning and knowledge distillation will be investigated to further reduce computational complexity and enhance the robustness of the model for real-world industrial inspection applications. Moreover, the current study does not explicitly differentiate defect severity levels (e.g., size or depth), which are important for engineering decision-making in aerospace applications. Incorporating geometric measurement and severity grading mechanisms will be considered in future work. The proposed framework demonstrates promising performance for intelligent defect detection in industrial inspection tasks. It is intended to serve as an auxiliary tool for visual inspection in aerospace manufacturing, rather than a complete solution for surface quality measurement or safety-critical evaluation.