1. Introduction
Steel strips are a core product in the iron and steel industry, widely used in fields such as automotive manufacturing, mechanical engineering, chemical equipment, and aerospace. Over the past decade, the production process technology of steel strips has achieved significant improvements, particularly in aspects of thickness, material quality, and shape [
1,
2]. However, with the increasing quality requirements for steel strips in high-end industries, the occurrence of surface defects (such as scratches and inclusions) remains unavoidable, which seriously affects production efficiency and product quality [
3]. Therefore, the detection of steel strip surface defects has become a key factor in ensuring product quality.
Currently, defect detection methods are mainly divided into three categories: traditional methods (e.g., manual sampling, infrared detection, and magnetic particle detection), machine vision-based detection, and deep learning-based detection technologies [
4]. The limitation of traditional methods lies in their relatively large detection errors; although machine vision methods have achieved improvements, they still have shortcomings in multi-type defect classification and real-time performance [
5]. Deep learning technologies, especially object detection technologies, have made significant progress in industrial applications, enhancing detection accuracy and efficiency [
6,
7].
Kou et al. [
8] applied the YOLOv3 algorithm to the thin steel surface defect image dataset NEU-DET, achieving an average precision (mAP) of 72.2%, confirming the suitability of YOLOv3 for thin steel surface defect detection. Cheng and Yu [
9] proposed a RetinaNet algorithm combining attention mechanisms and adaptive spatial feature fusion modules, which effectively improved defect detection performance on thin steel surfaces. Li et al. [
10] introduced an improved YOLOv5 network algorithm for small surface defects, incorporating a Convolutional Block Attention Module (CBAM) and optimizing the network architecture and loss function, achieving an average precision (mAP) of 91.0% on a self-constructed industrial defect dataset. Furthermore, Chen et al. [
11] proposed a fast thin steel surface defect detection network (DCAM-Net) based on deformable convolutions and attention mechanisms, significantly enhancing the network’s localization capability. This algorithm achieved an average precision (mAP) of 82.6% on a self-constructed dataset, surpassing the baseline YOLOX by 7.3% in mAP while achieving a detection speed of 100.2 frames per second, greatly improving the efficiency of cold-rolled steel defect detection. Xing and Jia [
12] designed a novel intersection-over-union (IoU) loss function—XIOU—to better detect thin steel surface defects. Wang et al. [
13] addressed the issue of algorithm failure due to noise in cold-rolled steel surface defect images by designing a noise regularization strategy to enhance the robustness of the training algorithm. At this stage, researchers are primarily constrained by insufficient algorithmic accuracy. Therefore, enhancing feature extraction capabilities and improving loss functions are the mainstream directions.
Understanding subsurface material behavior under coupled thermal and chemical influences has been emphasized in related domains, such as chemo-mechanical studies of shale–brine interactions. These insights highlight the importance of capturing multi-physics interactions, which are equally critical when assessing soil thermal responses around underground cables and steel strip surfaces. Emerging AI-driven diagnostic frameworks, such as those applied in clay-soil corrosion monitoring, may also provide pathways for predictive soil–cable thermal assessments, offering further inspiration for advancing defect detection systems in manufacturing processes.
Although deep learning has made significant progress in industrial detection, the challenge arising from improved detection accuracy is the large model size, computational redundancy, and difficulty in deployment. Furthermore, in practical production settings, limitations in detection speed and computing resources remain a challenge: many deep learning algorithms impose a heavy computational burden when processing high-resolution images, leading to slower detection speeds that fail to meet real-time requirements. Therefore, how to enhance computational efficiency while ensuring high accuracy—particularly in resource-constrained environments—remains a critical issue. Recent research on steel strip surface defect detection has highlighted the crucial role of lightweight models in detecting small defects.
To address the limitations of existing steel strip surface defect detection networks regarding detection speed and computational resources, many researchers have adopted lightweight algorithm approaches. Cai Jianfeng et al. [
14] integrated MobileNet into the Mask R-CNN object detection framework, using MobileNet as the feature extraction backbone to effectively reduce the algorithm’s parameter count and computational load. Subsequently, Zhang [
15] improved YOLOv5 by adopting the lighter ShuffleNetv2 as the backbone network, reducing the algorithm’s complexity and achieving some advantages in detection speed. Qin et al. [
16] proposed EDDNet, a lightweight algorithm for steel strip surface defect detection, which utilizes EfficientNet as the feature extraction backbone to significantly reduce computational overhead. Xie Zhanghao et al. [
17] replaced traditional convolution layers with GhostNet, significantly alleviating the computational burden, and enhanced feature extraction capability by introducing the Coordinate Attention (CA) mechanism, expanding the algorithm’s receptive field. Zhou et al. [
18] proposed the YOLOv5sGCE lightweight algorithm for steel strip surface defect detection, incorporating Ghost modules and CA mechanisms to minimize the algorithm’s size and computational requirements without compromising detection accuracy. Yang et al. [
19] developed the improved CBAM-MobilenetV2-YOLOv5 algorithm, combining the MobilenetV2 module and Convolutional Block Attention Module (CBAM) to create a more lightweight defect detection algorithm for steel strip surfaces. In recent years, Yan Xin et al. [
20] improved the SSD algorithm by integrating deconvolution operators and Transformer-based multi-head attention modules, enhancing detection accuracy while reducing resource consumption. Lu et al. [
21] proposed the lightweight defect detection algorithm DCN-YOLO, combining lightweight convolution blocks (DSConv) and Efficient Channel Attention (ECA) mechanisms to achieve a more compact algorithm without affecting detection accuracy. Wang et al. [
22] introduced DAssd-Net, a lightweight algorithm for steel strip surface defect detection, employing multi-branch dilated convolutions and multi-domain perceptual detection heads to reduce algorithm size while slightly improving detection accuracy. Wang Chunmei et al. [
23] introduced the YOLOv8-VSC lightweight defect detection algorithm, utilizing the lightweight VanillaNet as the feature extraction backbone and incorporating the SPD module to reduce network layers while accelerating algorithm inference speed. Additionally, they employed the lightweight upsampling operator CARAFE to enhance the quality and richness of the fused features. Tie et al. [
24] proposed LSKA-YOLOv8, a lightweight defect detection model based on YOLOv8. The model incorporates KWConv to reduce computational complexity, BiFPN for better contextual information capture, and RFB to expand the sensory field. Additionally, the LSKAttention module improves target feature capture, boosting detection performance. Experiments showed improved accuracy with reduced parameters and computational cost, making it suitable for deployment on resource-limited devices.
Wang et al. [
25] proposed a lightweight steel surface defect detection method, which combines Efficient Feature Fusion and Dynamic Label Assignment mechanisms. This approach significantly reduces computational complexity while ensuring detection accuracy, thus achieving a balance between lightweight design and high performance. Zhu et al. [
26] proposed the LSwin Transformer for efficient steel surface defect detection. Key innovations include a convolutional embedding module, attention patch merging module, and a window shift strategy to improve feature interactions. They also combined CNN feature extraction with the Swin Transformer’s global dependency-building capability through a depth multilayer perceptron module. Ablation studies confirmed the model’s effectiveness, and transfer learning accelerated convergence, showing strong potential for steel surface defect detection. Alshawi et al. [
27], addressing the challenge of detecting small defects, proposed a detection framework that combines Dual Attention and Semantic Segmentation. By integrating detection and segmentation tasks, this framework compensates for the limitations of traditional detection algorithms in capturing fine textures. Chen et al. [
28] proposed an unsupervised anomaly detection model based on a Dual Autoencoder and Generative Adversarial Network (GAN). This model enables defect recognition under unsupervised conditions, reducing data labeling costs. Bui, N.-T. et al. [
29] developed a high-performance detector based on Attention and Segmentation Guidance. By effectively combining attention mechanisms and segmentation information, this detector achieves high precision and robustness in end-to-end detection. Steel surface defect detection has gradually evolved from traditional CNNs to a multi-path system that integrates lightweight models, attention enhancement, Transformer fusion, and unsupervised learning. The overall trend indicates that research is shifting from “high-precision detection” to “efficient, generalized, and adaptive detection.” Future efforts will focus on: efficient deployment at the edge; integration of anomaly detection and self-supervised learning; multi-scale feature adaptive learning; data augmentation; and interpretability analysis. The algorithm proposed in this paper aims to achieve efficient edge deployment of lightweight models.
These studies have made progress in lightweight computation-intensive object detection algorithms and provided valuable insights for this field. However, in the field of steel strip surface defect detection, the computing capacity of terminal detection devices is usually only a few GFLOPS, with memory less than 8 GB. In contrast, the computational load of mainstream object detection algorithms (such as the YOLO series) is generally between 40 and 150 GFLOPS, and the prediction for 1024 × 1024 images requires approximately 16 GB of memory. Such computational requirements exceed the processing capacity of terminal detection devices, hindering the stable operation of large-scale object detection algorithms. To address such issues, in order to achieve a lightweight algorithm while ensuring the accuracy of the detection algorithm, this paper conducts lightweight improvement using the more advanced YOLOv8n algorithm in the YOLO series and proposes the Finite Element Method-YOLO (FEM-YOLO) network algorithm, as shown in
Figure 1. The main contributions of this paper include:
- (1)
A lightweight FeatureNet network is adopted as the backbone network for feature extraction. Through more efficient parameter sharing and feature extraction mechanisms, the complexity of the model is reduced.
- (2)
The C2f-Enhance module is designed, which combines the integration of EnhanceConv and C2f modules. This improves the representation capability of details and edges in feature maps, enhances detection accuracy, and does not increase the number of model parameters or computational load.
- (3)
To further improve detection accuracy while achieving lightweight design, a lightweight shared convolution detection head (MSCD) is developed. By sharing weight parameters, the overall number of parameters is reduced.
- (4)
The adopted Focal_EIoU loss function combines the advantages of Focal-Loss and EIoU. It enhances the algorithm performance when processing challenging samples and simultaneously penalizes deviations in position and shape, improving localization accuracy.
3. Results
3.1. Dataset
This paper uses the public steel strip surface defect dataset from Northeastern University (NEU-DET), as shown in
Figure 7. This dataset contains six types of defects: Cracks (Cr), Inclusions (In), Patches (Pa), Pitted Surfaces (Ps), Rolled-in Scale (Rs), and Scratches (Sc).Each defect type has 300 images, totaling 1800 images, which are augmented to 3600 images through random transformations. The dataset is divided into a training set, validation set, and test set in an 8:1:1 ratio, with 2880, 360, and 360 images, respectively. The defect types exhibit different sizes, shapes, texture features, and uneven distributions. In particular, the images of Cr and Rs defects face problems of irrelevant noise and uneven distribution, while Ps, Pa, and Sc defects have low contrast issues. These factors can affect detection performance, leading to missed detections and reduced algorithm feasibility.
3.2. Experimental Setup
During the training phase, this study utilized four GPUs (NVIDIA TITAN XP) and employed Python 3.8 as the programming language, with PyTorch 2.1.0 as the deep learning library and PyCharm 2023 as the integrated development environment. The experimental parameters are summarized in
Table 1. The input image size was set to 640 × 640, and data augmentation was performed using the Mosaic method. The initial learning rate was set to 0.01, with a cosine annealing strategy used to adjust the learning rate throughout training. The batch size was configured to 16, and the SGD optimizer was selected. Dropout regularization was applied, and the training was conducted over 300 epochs. Unless otherwise stated, all experiments in this study were performed using the above settings.
3.3. Dataset Augmentation Methods
In steel strip defect detection, the performance of deep learning models heavily depends on a large and diverse training dataset. However, image data for steel strip defects is often limited, with insufficient samples to cover all possible defect scenarios. This scarcity of data restricts the model’s generalization ability, making it challenging to maintain effective detection performance, especially when dealing with complex or previously unseen defects. To address this challenge, employing data augmentation techniques to generate diverse training data has become a crucial strategy. This paper introduces a data augmentation method based on random image transformations to improve the robustness and generalization ability of the steel strip defect detection model. The method involves randomly applying several image transformation operations to generate augmented samples that differ from the original images while preserving the core defect information. These transformations include horizontal flipping, random rotation, scaling, translation, and random adjustments to brightness and contrast. Although generative models and similar methods also offer advantages, random transformations are a suitable choice for processing steel defect detection data due to their simplicity and efficiency. They can quickly increase the diversity of the data without introducing excessive computational overhead, making them especially well-suited for real-time applications. By applying these transformations, the diversity of the training data is significantly increased, allowing the model to better adapt to a wide range of steel strip defects.
(1) Horizontal Flipping.
Images are randomly flipped horizontally with a 50% probability. This operation generates symmetrical samples, simulating variations in defect orientation that may arise during the actual production process.
(2) Random Rotation and Scaling.
Images are randomly rotated between 0 and 90 degrees and scaled within a 50% range, with a 50% probability. This simulates variations in the angle and size of defects, enabling the model to detect defects at different scales and orientations.
(3) Translation Transformation.
The images are translated with a displacement ratio limited to 0.1. This transformation enhances the model’s robustness to changes in defect location.
(4) Brightness and Contrast Adjustment.
The brightness and contrast of images are randomly adjusted with a 20% probability. This color transformation helps the model maintain sensitivity to defects under varying lighting conditions, improving its adaptability in real-world industrial environments.
The Albumentations library is employed to implement these augmentation operations, with all transformations tailored to the practical scenarios of steel strip defects, ensuring that the transformed images retain the essential characteristics of the defects. Through this data augmentation approach, the extended dataset significantly enhances the performance of the model. Augmented data are generated using defects such as Inclusions (In) and Rolled-in Scale (Rs) as examples, as shown in
Figure 8.
3.4. Evaluation Metrics
This paper uses four metrics to evaluate the performance of the algorithm: mAP (mean Average Precision), FPS (Frames Per Second), algorithm parameter count (Params), and computational complexity (GFLOPs). mAP represents the average precision of all detected objects on the test set; a higher mAP value indicates better detection performance of the algorithm on objects of different categories. FPS measures the number of images processed by the algorithm per unit time; a higher FPS value means stronger real-time performance of the algorithm. The parameter count (Params) evaluates the complexity of the algorithm, and a smaller parameter count indicates that the algorithm is more lightweight. The GFLOPs metric evaluates the computational efficiency of the algorithm; a lower GFLOPs value indicates higher computational efficiency of the algorithm.
3.5. Comparative Experiments
3.5.1. Comparison of Different IoU Loss Functions
To verify the effectiveness of the proposed Focal_EIoU loss function, this paper compares the algorithm with the default CIoU of the baseline network and other loss functions (including EIoU, GIoU, and DIoU). The comparison results are shown in
Table 2, which presents the classification performance of different loss functions for six types of steel strip surface defects. Focal_EIoU exhibits relatively balanced performance, with an mAP of 83.3%. The default CIoU of the baseline network performs poorly in detecting Cr and Rs defects, with mAP values of only 63.9% and 68.9%, respectively, and also shows unsatisfactory performance in detecting other defects. GIoU achieves an mAP of 84.1% in detecting In defects but has low detection rates for Cr and Sc, which are 57.3% and 94.7%, respectively, indicating that GIoU fails to provide balanced detection across different defect types. DIoU shows a significant improvement in detecting Ps defects, with an mAP of 87.6%, and performs best in detecting Pa defects, reaching 90.4%. However, it does not solve the problem of detecting Cr and Rs defects and fails to provide balanced performance across all defect categories. The network using Focal_EIoU in this paper achieves the best overall performance.
With an mAP of 83.3%, it effectively addresses the shortcomings of CIoU in Cr and Rs defect detection while maintaining relatively balanced performance for other defects. The mAP decline when using SIoU may be related to its design, as it focuses more on standardized object detection. In contrast, steel strip defects typically exhibit characteristics of being small and slender and having diverse morphological variations. In this case, the geometric constraint mechanism of SIoU may have insufficient sensitivity to small objects, resulting in inadequate localization and recognition capabilities for tiny defects. The Focal_EIoU loss function adopted in this paper combines the advantages of Focal Loss and EIoU, enhancing the algorithm’s performance when dealing with challenging samples while penalizing deviations in position and shape to improve localization accuracy. Therefore, Focal_EIoU is suitable for the scenario of steel strip surface defect detection.
Through comparative experiments with various loss functions, we observed that the recognition accuracy for Cr defects is significantly lower than the average accuracy for the other five defect types. This issue poses a major challenge in steel strip defect detection. The visual features of Cr defects are less distinct compared to those of other defect categories, which may hinder the model’s ability to effectively distinguish Cr defects from the background or other normal areas, leading to decreased recognition accuracy. Furthermore, under experimental conditions involving specific noise, background interference, or uneven lighting, the detection of Cr defects becomes even more difficult. We attempted to enhance the representation of Cr defects, but this led to a new issue of class imbalance, where the proportion of Cr defects in the training data became disproportionately large. This, in turn, degraded the model’s ability to recognize other defect categories. Future work on Cr defect detection should focus on targeted algorithm improvements, such as incorporating domain-specific knowledge related to defects, implementing multi-scale analysis, or employing advanced data augmentation techniques, in order to further improve the model’s ability to detect these challenging cases.
3.5.2. Comparison of Different Defect Detection Algorithms
The proposed FEM-YOLO algorithm in this paper demonstrates the best performance in terms of mAP, algorithm parameter count, computational load, and FPS for steel strip surface defect detection. Compared with the baseline YOLOv8, FEM-YOLO increases the mAP by 8.4%, reduces the algorithm parameter count by 1.8 M, and decreases the computational load from 8.7 GFLOPs to 4.6 GFLOPs (a reduction of 4.1 GFLOPs). The results of comparative experiments with other State-of-the-Art (SOTA) algorithms are shown in
Table 3. FEM-YOLO significantly outperforms FastRCNN and SSD in terms of mAP, algorithm parameter count, and computational load. In the FPS performance test, FEM-YOLO also outperforms FastRCNN, SSD, YOLOv3, YOLOv5, YOLOv9, YOLOv10, and YOLOv11. Although the FPS performance of FEM-YOLO is slightly lower than that of YOLOv6 and YOLOv8, it still exhibits superior performance over YOLOv6 and YOLOv8 in terms of mAP, algorithm parameter count, and computational load.
Although YOLOv3 achieves an mAP of 79.9%, its algorithm parameter count and computational load are much higher than those of other YOLO series networks, thus failing to meet the lightweight standards. Furthermore, compared with the latest YOLOv11, FEM-YOLO shows better performance in model accuracy, parameter count, and computational volume.
Table 3 also presents the results of comparative experiments with algorithms proposed in references [
11,
23,
30,
31,
32,
33,
34]—all of which are the latest steel strip surface defect detection algorithms developed in the past two years. These algorithms are all applied to steel strip surface defect detection and use the NEU-DET dataset. Due to the fact that some references do not specify the hardware used, FPS comparison was not conducted. The DCAM-Net algorithm [
11] has an mAP of 82.6% and a parameter count of 31.0 M. The YOLOv8-VSC algorithm [
23] has an mAP of 80.8%, a parameter count of 1.96 M, and a computational cost of 6.0 G. The LF-YOLO algorithm [
31] has an mAP of 83.7% and a computational cost of 88.7 G. The MCB-FAH-YOLOv8 algorithm [
32] has an mAP of 81.8% and a parameter count of 6.06 M. The improved YOLOv8 algorithm [
33] has an mAP of 81.1%, a parameter count of 4.9 M, and a computational volume of 9.2 G. The EHH-YOLOv8s algorithm [
34] has an mAP of 79.7%, a parameter count of 7.0 M, and a computational cost of 15.6 G. Although the mAP of the LF-YOLO algorithm is 0.4% higher than that of FEM-YOLO, its computational cost is 88.7 G, which is much higher than the 4.6 G of FEM-YOLO. The FEM-YOLO algorithm designed in this paper is more lightweight compared with other steel strip defect detection algorithms. In summary, the proposed FEM-YOLO has the advantages of small algorithm size and low computational load while ensuring detection accuracy.
In summary, compared to algorithms such as YOLOv8-VSC, the proposed method not only excels in parameter compression but also significantly improves recognition accuracy while maintaining high computational speed. The introduced FeatureNet, C2f-Enhance, and MSCD modules bring fundamental innovations across three levels. FeatureNet expands the representation dimension through non-linear high-dimensional mapping (EWT), redefining the balance between efficiency and expressiveness. C2f-Enhance extends the feature domain from intensity space to gradient space, enhancing the perception of defect details. MSCD restructures the information flow through shared multi-scale convolutions, enabling minimal redundancy and inter-head collaboration. Together, these mechanisms establish a unified design philosophy that emphasizes representation richness, structural efficiency, and inter-module collaboration.
Table 3.
Comparison of different strip steel detection algorithms on NEU-DET dataset.
Table 3.
Comparison of different strip steel detection algorithms on NEU-DET dataset.
| Model | mAP (%) | Param (M) | FLOPs (G) | FPS |
|---|
| FastRCNN | 75.9 | 137.1 | 370.2 | 10 |
| SSD | 73.2 | 26.3 | 62.7 | 22 |
| YOLOv3 | 79.9 | 61.5 | 105 | 25 |
| YOLOv5 | 73.5 | 2.5 | 7.2 | 213 |
| YOLOv6 | 74.3 | 4.2 | 11.9 | 293 |
| YOLOv8 | 74.9 | 3.2 | 8.7 | 302 |
| YOLOv9 | 78.8 | 7.2 | 26.7 | 88.7 |
| YOLOv10 | 71.9 | 2.3 | 6.7 | 231 |
| YOLOv11 | 80.9 | 2.6 | 6.6 | 242 |
| [30] | 83.7 | - | 88.7 | - |
| [23] | 80.8 | 1.96 | 6.0 | - |
| [32] | 81.8 | 6.06 | - | - |
| [11] | 82.6 | 31.0 | - | - |
| [33] | 81.1 | 4.9 | 9.2 | - |
| [34] | 79.7 | 7.0 | 15.6 | |
| Ours | 83.3 | 1.4 | 4.6 | 265 |
3.5.3. Visual Comparison Experiment of Steel Strip Surface Defect Detection
In this paper, visual comparison experiments were conducted using the optimal weights of the trained YOLO series steel strip defect detection algorithms, and the results are shown in
Figure 9. In the figure, the blue, white, dark blue, cyan, light blue, and purple boxes represent the detection results of Cr, Pa, Rs, Ps, In, and Sc, respectively. The detection results show that YOLOv3, YOLOv5, YOLOv6, YOLOv8, YOLOv9, and YOLOv10 all perform well in detecting Pa, Rs, In, and Sc defects. However, except for the method proposed in this paper, other methods fail to completely detect Cr defects or have low confidence in detection results. This problem stems from the low contrast in surface defect images—Cr exhibits complex texture features and significant irrelevant noise, making it difficult for the detection network to accurately capture its position, which in turn leads to missed detections. FEM-YOLO effectively improves the detection capability for such defects, enabling more accurate capture of key features in images with rich details, thereby enhancing the accuracy of object detection and better capturing Cr defects. In addition, when detecting Ps defects, networks such as YOLOv5, YOLOv6, YOLOv8, YOLOv9, and YOLOv10 face difficulties in accurately localizing defects, resulting in defect merging or missed detections. This limitation is attributed to the insufficient overall feature extraction capability of these networks when dealing with irregular defects. The FEM-YOLO algorithm proposed in this paper enhances the correlation between the target features of irregular steel strip defects, enabling the detection network to obtain more comprehensive defect feature information and improving the feature extraction capability.
3.6. Ablation Experiments
To validate the effectiveness of FEM-YOLO, YOLOv8n was adopted as the baseline network, and progressive performance evaluations were conducted for each improvement, including data augmentation, replacement of the backbone with FeatureNet, incorporation of the improved C2f-Enhance module, and integration of the Multi-path Shared Convolutional Detection (MCSD) module. The results of the ablation experiments for each enhancement module are presented in
Table 4. As shown in
Table 4, the first enhancement involves data augmentation. The application of a Random Transformation (RT) method improves the feature representation of the augmented defect images, enabling the algorithm to learn defect characteristics more effectively. Consequently, the mean Average Precision (mAP) increased from 74.9% (baseline YOLOv8n) to 77.2%. Building upon this, FeatureNet was introduced to replace CSPDarkNet as the backbone network. With the use of FeatureConv, richer and more expressive feature representations were obtained without requiring complex network designs or additional computational costs. This modification made the overall network more compact and efficient—the floating-point operations (FLOPs) decreased from 8.7 G to 6.5 G, the number of parameters was reduced by 1 M, and the FPS increased by 16. Furthermore, the improved C2f-Enhance module, equipped with EnhanceConv, enhanced the algorithm’s average precision by 1.5%. However, a decrease in FPS from 318 to 278 was observed. This reduction is attributed to the increased computational time required by the C2f-Enhance module, which strengthens the model’s feature extraction capability and improves detection accuracy and feature expressiveness. Despite the decline in FPS, this precision improvement provides a significant advantage in practical steel defect detection scenarios. Finally, the designed Multi-path Shared Convolutional Detection (MCSD) head further reduced the model size, lowering the FLOPs to 4.6 G and the number of parameters to 1.4 M while increasing the mAP to 79.2%.
The results of the ablation experiments clearly indicate that the improved FEM-YOLO algorithm outperforms the YOLOv8n algorithm. Specifically, FEM-YOLO achieves a 4.2% improvement in mAP while reducing both the computational load and parameter count to 43.7% and 52.9% of the baseline algorithm, respectively, with only a reduction of 37 FPS. Despite this decrease, FEM-YOLO maintains a high processing speed of approximately 256 FPS, without significantly compromising computational accuracy or requiring additional computational resources. In other words, our method successfully strikes a balance between high real-time throughput, accuracy, and efficient resource utilization. Overall, the performance of the FEM-YOLO algorithm surpasses that of YOLOv8n.
3.7. Algorithm Feasibility Verification
To further verify the feasibility of the proposed FEM-YOLO algorithm, this paper compares the performance of YOLOv8n and FEM-YOLO on the Aluminum Surface Defect Dataset (APSPC) and the GC10-DET dataset, respectively, so as to further illustrate the feasibility of the FEM-YOLO algorithm. The Aluminum Surface Defect Dataset (APSPC) contains 1885 aluminum surface defect images, divided into ten defect categories: Dents, Non-conductive Areas, Scratches, Orange Peel, Bottom Leaks, Impacts, Pits, Protruding Powder, Coating Cracks, and Stains. The GC10-DET dataset is a real-world surface defect dataset derived from industrial environments, including ten types of surface defects: Punching (Pu), Welding Seam (Ws), Crescent Gap (Cg), Water Stains (Ws), Oil Stains (Os), Silk Spots (Ss), Inclusions (In), Rolling Pits (Rp), Creases (Cr), and Waist Creases (Wf). All defects appear on the steel plate surface, and this dataset contains 3570 grayscale images. Both datasets are divided into a training set, validation set, and test set in an 8:1:1 ratio. The experimental setup is consistent with the above, and the experimental results are shown in
Figure 10 and
Figure 11.
3.7.1. Feasibility Study on the APSPC Dataset
The results obtained after training the YOLOv8n and FEM-YOLO models on the APSPC dataset are shown in
Figure 10. The left figure is the PR (Precision-Recall) curve of the YOLOv8n model after training, and the right figure is the PR curve of the FEM-YOLO model after training. On the Aluminum Surface Defect Dataset (APSPC), the mAP of the baseline algorithm after training is 56.4%, while the mAP of the FEM-YOLO algorithm after training is 58.4%. Compared with the baseline algorithm, the proposed FEM-YOLO increases the mAP by 2%, which can prove that the FEM-YOLO algorithm has good feasibility.
3.7.2. Feasibility Study on the GC10-DET Dataset
The results obtained after training the baseline algorithm and the FEM-YOLO model on the GC10-DET dataset are shown in
Figure 11. The left figure is the PR (Precision-Recall) curve of the YOLOv8n model after training, and the right figure is the PR curve of the FEM-YOLO model after training. On the GC10-DET dataset, the mAP of the baseline algorithm after training is 67.7%, while the mAP of the FEM-YOLO algorithm after training is 72.2%. Compared with the baseline algorithm, the proposed FEM-YOLO increases the mAP by 4.5%, which can prove that the FEM-YOLO algorithm has good feasibility.
Through the above two sets of experiments, it is indicated that the FEM-YOLO algorithm proposed in this paper is not limited to the NEU-DET dataset. Its good performance on the NEU-DET dataset is not accidental; it is also applicable to other datasets containing small defect types. Compared with YOLOv8n, it exhibits excellent performance and possesses certain feasibility.
3.8. Model Latency and System Energy Consumption
Model latency and system energy consumption are not only technical indicators but also key factors affecting user experience, operating costs, and system stability. Low latency ensures that user operations can receive timely feedback, improving user experience; low energy consumption means that the system consumes fewer hardware resources during operation, making it suitable for deployment on front-end devices with limited resources. By optimizing these two aspects, higher economic and social benefits can be achieved while improving system performance.
This paper uses model inference speed (Inference) as the evaluation criterion for model latency. The shorter the time required for inference speed, the better the latency performance of the model. As shown in
Figure 12, this paper compares the inference speed of YOLOv3, YOLOv5, YOLOv6, YOLOv8, YOLOv9, YOLOv10, YOLOv11, and FEM-YOLO algorithms for a single steel strip scratch defect image, and their detection results are presented in the form of confidence. It can be found that YOLOv6 has the fastest inference speed of 147 ms, with the best model latency performance, but its confidence in predicting scratch defects is average at 0.47. After comprehensive comparison, it can be seen that the inference speed of FEM-YOLO is 158 ms, which is the fastest among algorithms except YOLOv6, and its confidence is also relatively high at 0.75. Compared with the baseline network YOLOv8, whose model inference speed is 160 ms, FEM-YOLO reduces the speed by 2 ms and increases the confidence by 0.35, which is sufficient to prove that the FEM-YOLO algorithm has relatively good model latency performance. For the comparison of system energy consumption, the lower the model’s parameter count (Params) and computational volume (GFLOPs), the easier it is to deploy to front-end hardware, the lower the requirement for hardware computing resources, and the lower the system energy consumption. The FEM-YOLO algorithm in this paper has a parameter count of 1.4 M and a computational volume of 4.6 G, both of which are lower than those of the compared deep learning object detection algorithms.
3.9. Hardware Deployment Comparison Experiment
To validate that our model can adapt to lower computational resources, we conducted a comparative experiment by deploying FEM-YOLO and YOLOv8n on both the Jetson Nano and Jetson Nano NX. The results demonstrate that FEM-YOLO can maintain strong performance even with limited computational resources. As shown in
Table 5, FEM-YOLO performs stably on the Jetson Nano, despite the hardware resource disparity compared to the Jetson Nano NX. Notably, there is minimal difference in recognition accuracy, power consumption, memory occupancy, and memory usage, all of which remain within acceptable limits. Additionally, it is worth mentioning that, under the same hardware resources, FEM-YOLO outperforms YOLOv8n in terms of overall performance.