4.2. Experimental Environment and Model Configuration
The experiments and model performance tests in this study were conducted on a Windows 11 operating system. The experimental environment was configured as follows: an RTX 3080 Ti GPU with 12 GB of VRAM, a 12-core vCPU Intel(R) Xeon(R) Silver 4214R CPU @ 2.40 GHz, 30 GB of RAM, and CUDA version 11.0. The model parameters were set as follows: input image size of 640 × 640 pixels, batch size of 8, and 200 training epochs (including 50 frozen epochs to improve training efficiency). The SGD optimizer was used, with a momentum of 0.937 and a weight decay of 0.0005 to prevent overfitting. The initial learning rate was 0.01, which was dynamically adjusted using a cosine annealing schedule. To ensure fairness in the comparative experiments, both the ablation studies and model training utilized pre-trained weights from the VOC2007 + 2012 public datasets. The specific parameters are shown in
Table 1.
The performance metrics used to evaluate the modified algorithm mainly include Mean Average Precision (mAP), model parameters (Params), and the floating-point operations (FLOPs) that represent the model’s complexity. These metrics are used to objectively assess the performance of the modified network.
Mean Average Precision (mAP) is based on the precision (P) and recall (R) to form a curve and the area under the P–R curve. A larger area under the curve indicates higher accuracy in category detection. The evaluation parameters are shown in
Table 2. Some of the formulae for these metrics are presented in Equations (1)–(3).
Among them, n is the number of sample categories to be detected.
Furthermore, the evaluation of the proposed model considers important indicators such as parameter count, computational complexity, and detection speed. These metrics provide a comprehensive assessment of the algorithm’s accuracy, recall rate, adaptability, and efficiency, serving as crucial evaluation metrics for improved defect detection performance.
4.3. The Comparative Experiment on the NEU-DET Dataset
In order to verify the feasibility of the CTG-YOLO model in industrial settings, this study selects some representative models for comparative experiments. The representative models include YOLO-series algorithms (such as YOLOv3-tiny, YOLOv4, YOLOv7-tiny, YOLOv8s, YOLOv9s, and YOLOv10s), Faster R-CNN, SSD, and other improved models. The performance indicators on the NEU-DET dataset are summarized in
Table 2. Through comparison with these models, the effectiveness and superiority of CTG-YOLO can be further verified.
Table 2 comprehensively compares the detection accuracy, model complexity, and real-time performance of typical object detection models on the NEU-DET dataset. Experimental results show that the proposed algorithm achieves significant advantages in detecting crazing (Cr) and pitted surface (Ps) defects, with AP values reaching 42.09% and 90.35%, respectively, corresponding to improvements of 6.51% and 2.38% over the baseline YOLOv8s. Particularly for the low-contrast, complex-texture crazing defects, CTG-YOLO efficiently extracts fine-grained features through the CBY parallel structure and enhances global feature fusion via the TFF module, effectively improving detection accuracy and achieving a substantial breakthrough.
In terms of computational complexity, two-stage detection algorithms such as Faster-RCNN have significant shortcomings. Their numbers of FLOPs reach as high as 402.02 G, and FPS is only 20.7, resulting in high resource consumption and poor real-time performance, making them unsuitable for high-speed online detection in production lines. Lightweight models like YOLOv3-tiny and YOLOv7-tiny, while having relatively low number of FLOPs of 12.9 G and 13.8 G respectively and high FPS, suffer from a significant drop in detection accuracy, with mAP values of only 68.6% and 66.4%, failing to meet the accuracy requirements for steel surface defect detection. Models like YOLOv9s and FCOS, which are popular in single-stage detection, have improved accuracy, but YOLOv9s has increased model complexity, resulting in 38.7 G FLOPs, which causes a noticeable drop in real-time performance (FPS of 31). FCOS has a higher computational complexity, with 50 G FLOPs and an mAP lower than that of CTG-YOLO.
In contrast, CTG-YOLO strikes a good balance between detection accuracy, computational complexity, and real-time performance. It achieves the highest mAP of 76.55%, outperforming all other compared models. It has fewer FLOPs (27.216 G) than the baseline YOLOv8s model (28.817 G) and most other mainstream models, resulting in lower computational costs. The FPS is 122, which is slightly higher than YOLOv8s, far exceeding YOLOv9s, Faster-RCNN, and other models and meeting the real-time detection requirements in industrial scenarios. Although the parameter count of 13.667 M is slightly higher than that of YOLOv8s (11.167 M), the significant improvement in detection accuracy and the optimized computational complexity make this increment reasonable for engineering applications. Additionally, CTG-YOLO maintains excellent detection accuracy for inclusions (In) and rolled-in scale (Rs) defects, making it suitable for practical engineering applications. Visualization results (
Figure 9a–d) show that the proposed algorithm improves detection accuracy compared to other models, effectively detecting various steel surface defects with lower computational costs and high real-time performance.
4.4. Comparative Experiment on the GC10-DET Dataset
In order to further explore the effectiveness of CTG-YOLO, the detection performance test was carried out on the GC10-DET dataset, the results of which are shown in
Figure 10A–C. Compared with the NEU-DET steel surface defect dataset released by Northeastern University, the GC10-DET dataset covers a wider range of defect types and is closer to changeable industrial scenarios. The defect categories are shown in the
Figure 10(Aa–Al). For comparison, the detection algorithms used on the GC10-DET dataset were mainly selected. However, considering that some models do not have verification results on the GC10-DET dataset, some improved algorithms with excellent performance are also introduced for comparison. In the experiment, AP and mAP were used as evaluation indicators.
The comparison results are shown in
Table 3. Analysis of specific defect types shows that CTG-YOLO demonstrates excellent performance in the detection of inclusions (In) and creases (Cr), achieving AP values of 45.79% and 52.38%, which are 10.49% and 11.48% higher than those of the FPDNet algorithm. This indicates that CTG-YOLO enhances the multi-scale feature expression ability of the model. For the detection of punched holes (Pu), welds (Wl), crescent-shaped gaps (Cg), water spots (Ws), oil spots (Os), silk spots (Ss) and waist creases (Wf), although there are certain fluctuations, satisfactory results were also achieved. Overall, CTG-YOLO shows excellent comprehensive capabilities, proving the effectiveness and generalization ability of the improved model and providing strong evidence for the practicality of the improved module.
Although the mAP of CTG-YOLO exceeds that of most algorithms and its detection result for inclusions (In) also reached an astonishing 42.45%, unfortunately, CTG-YOLO seems to continue the common problem of YOLO-series algorithms. In the detection of rolled pits (Rp), its detection accuracy is only 20%. This is because such defects are quite similar to the background, and they are easily treated as the background during detection, posing a detection challenge for this type of defect in the model. Although CTG-YOLO enhanced the multi-scale expression ability of the model by introducing the CBY and TFF modules, it offers little help in the field of image information contrast. Liu [
35] enhanced image contrast information through edge enhancement at the graphic input stage, and the detection accuracy of rolled (Rp) defects reached an astonishing 48.8%. However, the detection accuracy of other defect types decreased to a certain extent, demonstrating the disadvantages of edge enhancement for the improved algorithm. Therefore, the next step of work is to enhance the fine-grained geometric information in the image while ensuring the improvement of the model’s detection accuracy, providing substantial help for the detection of difficult-to-detect defects such as rolled (Rp) defects.
4.5. Ablation Experiment
In order to further demonstrate the influence of each improved module on the model, this paper conducts ablation experiments on the NEU-DET dataset with YOLOv8s as the base model to prove the applicability of each algorithm. In order to accurately evaluate the performance and complexity of the model, AP, mAP, FLOPs, FPS and the number of parameters are used as evaluation indicators for the ablation experiment. In order to more intuitively display the ablation results, in this paper, different improvement methods are abbreviated as follows:
Baseline.
The combination of Baseline and ConvNext-C2f parallel structure is abbreviated as C-YOLO.
The combination of Baseline and the TFF-PANet neck structure is abbreviated as T-YOLO.
The combination of Baseline and the GSconv prediction head is abbreviated as G-YOLO.
The combination of Baseline, ConvNext-C2f and TFF-PANet is abbreviated as CT-YOLO.
The combination of Baseline, ConvNext-C2f and GSconv is abbreviated as CG-YOLO.
The combination of Baseline, TFF-PANet and GSconv is abbreviated as TG-YOLO.
The combination of Baseline, ConvNext-C2f, TFF-PANet and GSconv is abbreviated as CTG-YOLO.
The results of the ablation experiment (
Table 4) validate the effectiveness of the improved modules in terms of accuracy, lightweight design, and computational efficiency from the perspective of five dimensions, such as AP and mAP.
CBY Module: C-YOLO introduces the CBY parallel structure as the backbone module. The number of parameters increases to 13.227 M, and the computational complexity rises to 30.463 G. The mAP increases to 73.36%, with a detection speed of 118 FPS (2 FPS lower than the baseline), achieving a preliminary balance between accuracy and efficiency.
TFF Module: T-YOLO incorporates the TFF module in the neck network. The computational complexity increases to 33.092 G, and the detection speed drops to 115 FPS. However, the AP for Cr improves to 41.51%, and the mAP reaches 74.07%, demonstrating the value of multi-scale feature fusion.
GSConv Module: G-YOLO replaces the standard convolution in the prediction head with the lightweight GSConv module. The mAP rises to 74.60%, the number of parameters decreases to 9.196 M, and the computational complexity drops to 21.294 G. The detection speed increases to 132 FPS (12 FPS higher than the baseline), showcasing its advantage in terms of lightweight design.
Multiple-Module Combination: CT-YOLO integrates both CBY and TFF modules, increasing the mAP to 75.97%, highlighting the combined advantages of the two modules. TG-YOLO combines the TFF module and GSConv, achieving no loss in accuracy and lower computational cost than the baseline. CTG-YOLO integrates all three modules, with 13.667 M parameters and a computational complexity of 27.216 G (lower than the baseline). The detection speed is 122 FPS (slightly higher than the baseline), and the mAP reaches 76.55%. Each module works effectively on its own, and although some optimizations may “cancel out” when combined, the overall effect is significant.
4.6. Class-Wise Error Analysis
To quantitatively reveal the error distribution and category confusion characteristics of the CTG-YOLO model in steel surface defect detection, this section systematically analyzes the sources of detection errors by combining the precision, recall, and F1 score for each defect category based on the general industrial defect detection matching standard (IoU threshold of 0.5). The results are shown in
Table 5 and
Table 6.
Specifically, precision reflects the proportion of samples predicted by the model as a specific defect category that are truly of that category, demonstrating the model’s ability to suppress false positives (FPs). Recall reflects the proportion of samples that truly belong to a specific defect category and are correctly detected by the model, demonstrating the model’s ability to suppress false negatives (FNs). The F1 score, as the harmonic mean of the two, comprehensively measures the model’s detection accuracy for that defect category. Through these metrics, the model’s detection performance for each defect category can be further quantified, clarifying its main error types.
4.6.1. Classification Performance Metrics for the NEU-DET Dataset
Based on the data in
Table 5, a systematic analysis of the model’s detection performance and core issues for each defect category can be performed.
Crazing (Cr, the most challenging category) exhibits extreme characteristics of high precision and extremely low recall; its precision of 72.73% indicates that the model has strong classification accuracy for the detection of Cr cracks (the proportion of misclassification as other categories is low). However, the recall is only 11.11%, meaning that only about 33 out of 300 Cr samples were successfully detected, with nearly 90% of the samples being missed due to low contrast, small size, and complex texture (dominated by false negatives, FNs). This directly results in an F1 score of only 19.00%, which becomes a critical bottleneck limiting the overall performance of the model.
Scratches (Sc, the best detected category): With a recall of 94.12%, precision of 87.27%, and F1 score of 91.00%, this category has the best performance for all metrics. This is because SC scratches often present clear linear textures, with high contrast against the background and regular shapes. The model’s CBY parallel module and TFF-PANet can effectively capture its features, leading to extremely stable detection.
Rolled in scale (Rs, moderate detection category) achieves a precision of 82.86%, recall of 39.19%, and F1 score of 53.00%, making it the worst performing category, except for Cr cracks. The core issue lies in the low recall rate. Rs defects are often irregularly distributed iron scale imprints, with some areas showing high similarity to the background texture, causing a large number of samples to be missed. Meanwhile, precision is lower than that of the Pa, Ps, and In categories, indicating some confusion with the scratches category. The combination of these two factors leads to poor overall detection performance.
In summary, the model’s detection performance on the NEU-DET dataset shows significant differentiation: The categories of scratches, patches, and pitted surfaces perform excellently, whereas inclusion shows moderate performance, while crazing and rolled-in-scale defects are the main weaknesses. The extremely low recall rate of crazing is the core issue and requires targeted improvements by enhancing the low-contrast defect feature signals and optimizing the multi-scale feature fusion strategy. Rolled-in-scale defects, on the other hand, require a focus on improving recall while suppressing category confusion errors.
4.6.2. Classification Performance Metrics for the GC10-DET Dataset
The GC10-DET dataset covers 10 common steel surface defects in industrial scenarios, with samples that are more representative of the complexity in real production environments. The classification performance metrics are shown in
Table 6.
High-performance detection categories (F1 ≥ 80%) include Pu (punching), Wl (weld seam), Cg (crescent-shaped gaps), and Ws (water stains). The key advantage of these defects is their distinct morphological features and high contrast with the background. Among them, Pu defects achieve 100% precision, 93.94% recall, and an F1 score of 97%, performing the best. Punching defects are mostly regular circular or square holes with clear boundaries and unique features. The model’s CBY parallel module can fully capture their contour information, with almost no missed detection or false positives. Wl (weld seam) defects feature long, continuous stripes, and Cg (crescent-shaped gap) defects have a unique contour, so both are easily distinguishable. Although Cg’s precision is only 76.92%, its high recall rate of 95.24% means very few missed detections, meeting the industrial requirement of “prioritizing the prevention of missed key defects”.
Moderate detection categories (60% ≤ F1 < 80%) include Os (oil stains), Ss (silk marks), and Wf (waist creases). These defects share the characteristic of having somewhat dispersed features: Os (oil stains) are mostly irregular and sheet-like, with some areas having small differences in grayscale from the background, leading to a recall rate of only 56.60%. Ss (silk marks) are fine, scattered defects with weak feature signals, resulting in a recall rate of 53.75%. Wf (waist creases) have 100% precision (no false positives), but due to some shallow creases overlapping with the steel rolling texture, their recall rate is 63.64%, with an overall F1 score of 78%.
Low-performance detection categories (F1 < 60%) include Rp (roll pits), Cr (creases), and In (inclusions), which represent the core weaknesses in the model’s detection and all exhibit “low recall rate” characteristics. For Rp, a precision of 100% means that once the model classifies a defect as a roll pit, it is completely accurate, with no category confusion, but its recall rate is only 20%. This means only about 46 out of 229 samples are successfully detected, with nearly 80% of the samples misclassified as background due to the “shallow depth of roll pits and high similarity to background textures,” making it one of the most challenging defect types in industrial scenarios. For In, we report a precision of 60%, recall rate of 13.04%, and an F1 score of 21%, all of which are the lowest. Inclusion defects are often small, scattered impurity points with very low contrast and prone to being confused with surface noise on the steel surface, resulting in significant missed detections and some false positives, leading to the worst detection performance. For Cr, the recall rate of 14.29% is similar to that for In, with only about 10% of samples detected. This is mainly because some creases are “latent creases,” with no obvious convex or concave features or grayscale differences, and their feature signals are weak, making them difficult to capture using the feature extraction module.
In summary, the model’s detection performance on the GC10-DET dataset exhibits significant “category dependence”; defects with regular shapes and high contrast are detected effectively, while smaller, low-contrast defects with high similarity to the background perform poorly. Specifically, the “high precision, low recall” issue with Rp requires enhancement of low-contrast defect feature signals, while In’s “low precision, low recall” requires optimization of both category-specific feature extraction and noise suppression capabilities to meet the industrial requirements for defect detection.
4.6.3. Quantitative Evaluation Based on the Confusion Matrix
To quantitatively reveal the category confusion characteristics of the CTG-YOLO model in steel surface defect detection, this section adds a confusion matrix-based analysis on top of the original precision, recall, and F1 score. Using the commonly adopted IoU = 0.5 matching threshold in industrial defect detection, a confusion matrix for the CTG-YOLO model on the GC10-DET dataset is constructed (
Figure 11). The rows of the matrix represent the true defect categories, while the columns represent the categories predicted by the model. The values in the matrix correspond to the number of matched samples for each category, presenting the degree of confusion between categories and the background misclassification situation.
The analysis results indicate that perforation (Pu) and weld seam (Wl) defects exhibit good distinguishability, with diagonal percentages of 93.94% and 88.46%, respectively, and no obvious misclassifications across categories. This is due to the regular morphological characteristics of these two types of defects, which significantly differ from other defect categories, allowing the model to accurately identify them.
In the main confusion pairs, there is a significant mix-up between pitting (Rp) and oil spots (Os), with 2.62% of the samples in Rp being misclassified as Os. This is due to the similarity in the grayscale features of the dark area at the bottom of pitting and oil-spot defects. The confusion between inclusions (In) and streaks (Ss) is more prominent, with 3.06% of the samples in In being misclassified as Ss, as both are fine, dispersed defects that are difficult to distinguish under low-contrast conditions. There is also confusion between crescent-shaped cracks (Cg) and water spots (Ws), with 3.06% of the samples in Cg being misclassified as Ws. This is due to the similarity between the blurry curved edges of crescent-shaped cracks and the texture of water spots.
A particularly notable issue is the background misclassification problem: The background misclassification rates for inclusions (In), pitting (Rp), and wrinkles (Cr) are as high as 86.90%, 79.91%, and 85.59%, respectively, all exceeding 75%, while the inter-category confusion rate is below 3.06%. Further data analysis confirms that the core bottleneck in detecting these types of defects lies in insufficient feature extraction capability. The model primarily faces the issue of missed detection rather than inter-category misclassification.
4.6.4. Quantitative Counts of TP/FP/FN for All Categories
This section presents the TP, FP, and FN counts of the CTG-YOLO model for each category on the NEU-DET dataset, specifying the number of correctly detected instances and the scale of errors for each defect type. The results are shown in
Table 7.
Table 8 shows the FP/FN breakdown by category for the CTG-YOLO model on this dataset, along with key evaluation metrics. The proportion of failure cases and representative examples are analyzed as follows. Crazing (Cr) exhibits a characteristic of missed detection: there are only 33 TP instances, while the number of FNs reaches 267, resulting in a missed detection rate of 89.00%. The features of this defect are highly fused with the steel rolling background, making it difficult for the CBY parallel module to capture faint feature signals. Nearly 90% of the samples are misclassified as background. It is worth noting that there are only three misclassifications and four background FPs, indicating that the model’s ability to distinguish the Cr category and differentiate from the background is relatively strong. The core bottleneck lies in insufficient feature capture, leading to missed detections, rather than category confusion or background misclassification.
Scratches (Sc) exhibit prominent background misclassification: There are 282 TP instances, with a recall rate of 94.00% and a missed detection rate of only 6.00%. However, there are 41 FP instances, of which 31 are background FPs. As a high-contrast linear defect, the model has strong ability to capture this type of feature, resulting in a very low missed detection rate. However, the normal rolling texture on the steel surface resembles the shape of Sc defects, leading to a significant number of background misclassifications. The core issue lies in the insufficient ability to differentiate between defects and background linear textures.
In summary, the detection performance of the CTG-YOLO model on the NEU-DET dataset shows significant differentiation. Scratches have the highest detection rate but suffer from severe background misclassification. Patches, pitting, and inclusions show stable performance, while cracks and pressed iron oxide scale are the main issues. The core problem with cracks lies in insufficient feature capture, leading to missed detections, while pressed iron oxide scale faces the dual challenge of missed detections and category confusion. Future optimizations should focus on enhancing feature extraction capabilities and improving the model’s ability to differentiate background textures.