1. Introduction
Metal surface defect detection has been a critical issue in ensuring high-quality metal product manufacturing and 3D printing [
1,
2,
3,
4,
5], and one of the most important issues is surface defects [
6,
7,
8,
9]. Common metal surface defects include cracking, scratching, patching, and corrosion, which are frequently observed on the surfaces of metal objects [
10]. These defects often affect the yield of metal product manufacturing, etc. [
11,
12,
13]. Defect detection is currently used in many industrial scenarios. For example, surface defect detection can be used to check for structural defects, such as dents and corrosion on racks, as well as pallets in warehouse management [
14] and surface cracks and roughness in 3D metal printing [
15]. It can be used to assess the surface-forming conditions of printed metal items, enabling the timely monitoring of product status.
In the field of surface defect detection, Zorić et al. [
16] extracted the intensity of each frequency component by computing the Fourier spectrum of defect images, thereby extracting features for classification. Ni et al. [
17] proposed a method based on edge partition features, utilizing adaptive partitioning and edge feature extraction to improve detection accuracy. Gyimah et al. [
18] employed complete local binary patterns to discriminate among information from defects and perform classification. These traditional classification methods have achieved certain success in industrial detection.
However, metal surface defect detection is a challenging task. Defects on metal surfaces often show significant scale variation. They range from micro-cracks to large defect areas. This wide-scale range makes it difficult for a model to capture the fine details of small targets while preserving the overall structure of large targets [
19,
20]. In addition, defects usually have irregular shapes. This requires the detection model to have stronger multi-scale feature fusion capability [
21,
22]. In industrial environments, the contrast between defects and the background texture is sometimes low, which can lead to blurred boundaries [
23,
24]. When dealing with complex defects, traditional metal surface defect detection methods lack the ability to learn from new data. Their generalization ability is also limited [
25,
26,
27].
Deep learning-based classification methods have achieved remarkable classification results in the field of metal surface defect detection, such as You Only Look Once (YOLO) [
28,
29,
30], R-CNN [
31,
32,
33], and transformer [
34,
35,
36]. They can automatically extract features from data for learning, reducing manual effort, and exhibit strong adaptability across various defect detection scenarios, achieving excellent recognition rates [
37,
38,
39]. In addition, multi-scale feature fusion mechanisms, such as pyramid-based feature fusion methods, play an important role in improving object detection performance.
Figure 1 shows the development of metal surface defect detection in recent years. It includes four stages: traditional machine learning-based, Convolutional Neural Network (CNN)-based, Pyramid Network (PN)-based feature fusion, and transformer-based ones. It also highlights representative methods and summarizes the key development trends.
Although the above deep learning-based methods have achieved highly desired performance in practice, directly applying general detection models to metal surface defect scenarios leads to several challenges. Small defects are easily missed, and low-contrast defects are difficult to identify. Therefore, it is necessary to compare and analyze the adaptability of different model structures in industrial defect scenarios. It is also important to systematically review recent improvements to general models for metal surface defect detection.
Several studies have summarized research in this field, such as surveys on steel surface defect detection and reviews of CNN-based defect detection methods [
40]. They provide useful references for understanding the development of metal defect detection techniques. However, the existing reviews involve certain limitations. On the one hand, most of them focus on a single material scenario. They lack a comprehensive perspective across different types of metals and do not provide a sufficient comparison for the latter. On the other hand, some of them mainly concentrate on traditional CNN architectures or earlier object detection models. They do not analyze the structural evolution of recently developed detection frameworks. At the data level, the scale, categories, and application differences of public metal surface defect datasets directly affect model evaluation and algorithm design. However, systematic summaries of these datasets are lacking. Therefore, it is necessary to develop a review focused on metal surface defects and to provide a systematic analysis of deep learning methods for metal surface defect detection.
In response to the above issues, this paper focuses on key challenges in industrial scenarios and comprehensively reviews the research progress of deep learning in metal surface defect detection. Compared with existing surveys [
41], the main features and contributions of this review paper are as follows:
- 1.
It systematically introduces and compares CNN-based detection methods at different stages and transformer-related detection methods in the field of metal surface defect detection.
- 2.
It establishes a unified cross-architecture comparison perspective. It clarifies the differences among detection paradigms in terms of accuracy, real-time performance, and application scenarios. It also compares the performance of different paradigms on the NEU-DET dataset.
- 3.
It systematically summarizes the detailed information and detection applicability of commonly used public datasets in metal surface defect detection. It introduces evaluation metrics and analyzes the differences between academic and industrial data from the perspective of data distribution.
- 4.
It builds a structured challenge analysis framework for industrial scenarios. It reveals the core bottlenecks in metal surface defect detection, including multi-scale modeling, efficiency trade-offs, and cross-domain generalization. It also proposes hybrid architectures that combine lightweight design and emerging models, such as Mamba, as important directions for future technological development.
This review systematically summarizes the current research on deep learning methods for metal surface defect detection.
Section 1 outlines fundamental studies in surface defect detection.
Section 2 presents deep learning approaches, including single-stage and two-stage detection algorithms based on CNN, as well as transformer-based detection methods.
Section 3 describes PN architectures for image processing.
Section 4 introduces several commonly used datasets for metal surface defect detection.
Section 5 discusses evaluation metrics specific to this field.
Section 6 presents a comparative analysis of the performance of different detection models when applied to metal surface defect detection.
Section 7 explores potential future research directions in metal surface defect detection.
Section 8 concludes this review paper.
2. Deep Learning Algorithms
2.1. Single-Stage Detection: YOLO
The YOLO series has consistently been a popular network model in single-stage object detection, characterized by its strong real-time performance, fast detection speed, simple model architecture, and ease of deployment. It has been widely applied in numerous object detection tasks [
42,
43,
44]. The classic YOLO series consists of a four-part framework: the input layer preprocesses images; the backbone performs general feature extraction; the neck processes the features to enhance their diversity and robustness; and the output layer produces the object detection results.
Figure 2 illustrates a detailed schematic diagram of the four modules in a classic YOLO network:
- 1.
The Convolutional Block Layer (CBL) is composed of a convolutional layer and a batch normalization (BN) layer.
- 2.
ResUnit is a residual combination of the CBL module.
- 3.
Cross Stage Partial 1_X (CSP1_X) is composed of the CBL modules, ResUnit modules, and a convolutional layer.
- 4.
Cross Stage Partial 2_X (CSP2_X) is formed by connecting a convolutional layer with multiple ResUnit modules.
With the continuous development of YOLO, there are now many more versions, e.g., the highly popular YOLOv8 series [
45]. Compared to previous versions, YOLOv8 further improves accuracy, featuring a richer gradient flow that enhances its feature learning capability. Additionally, it separates the classification and detection heads in the architecture, providing greater flexibility for modularization. However, when the models are of the same size type, the number of parameters in YOLO v8 is larger than that in YOLO v5. Therefore, if deploying the model to resource-constrained devices, YOLOv5 is more suitable [
46].
YOLOv12 was released in February 2025. Compared with other YOLO-series networks, it incorporates the new FlashAttention mechanism. This addition enhances the model’s receptive field. Meanwhile, it further improves the inference speed with a minimal impact on model performance [
47]. YOLOv13 proposes the Full-Pipeline Aggregation-and-Distribution Paradigm, which effectively alleviates information flow bottlenecks by establishing key information pathways across the backbone, neck, and head. Compared with YOLOv12, YOLOv13 enhances global context modeling through hyper-node-based representation, thereby improving recall in densely packed scenarios [
48].
YOLO is a single-stage detector, notable for its strong real-time performance compared to multi-stage detectors, and it requires fewer computational resources than other models, e.g., R-CNN or transformer-based models. This makes it suitable for production tasks with limited resources and high real-time requirements. To further enhance the performance of YOLO and its major versions in the field of metal surface defect detection, we summarize the YOLO-related work for metal surface defect detection. Zhao et al. [
49] introduced a dual feature pyramid in the neck network of YOLOv5 and replaced the convolutional layers in the backbone with the Res2Net network to enlarge the receptive field. To handle the defects of varying sizes in defect detection, Huang et al. [
50] designed a deformable convolutional feature extraction module and incorporated a context enhancement mechanism into the YOLOv5 network to aggregate information within the network. Similarly, Fan et al. [
51] integrated a context integration module into the YOLO network and employed a genetic algorithm to cluster and fine-tune the anchors fed into YOLO, thereby improving the correlation between the dataset and the algorithm. Xie et al. [
52] used the Kmeans++ algorithm to optimize the quality of the initial anchors and introduced separable convolutions to extract features at different scales. Gao et al. [
53] used a PN architecture to effectively preserve shallow feature information when designing the YOLO network. Additionally, they introduced Normalized Wasserstein Distance and Complete Intersection Over Union as a loss metric to improve the recognition ability for small targets. These optimized YOLO models were optimized in terms of multi-scale feature extraction capabilities and anchor adjustment. Zhao [
49], Huang [
50], and Gao [
53] improved the multi-scale defect extraction capabilities of the defect detection network, enhancing the model’s ability to detect defects of different sizes. Fan [
51] and Xie [
52] optimized the network’s anchor selection, enabling the anchors to better adapt to the dataset’s target distribution. These methods have all improved the performance of the YOLO model in the field of metal surface defect detection and effectively enhanced the model’s applicability to complex defect scenarios.
Table 1 presents the information on the improved YOLO models. It can be observed that most YOLO-based improvements focus on two main directions: multi-scale feature extraction and anchor optimization. In YOLO models, detection performance can be improved by introducing an FPN and refining anchor box boundaries. These methods improve the detection of small targets and defects with scale variation. A clear performance trade-off can also be identified in these improvements. Multi-scale fusion and node enhancement can increase detection accuracy. However, they also lead to larger model size and higher computational complexity.
Overall, research on YOLO-based models for metal surface defect detection mainly aims to improve small target detection and multi-scale capability. At the same time, these studies seek to balance real-time performance and accuracy to meet the practical requirements of industrial online inspection.
2.2. Two-Stage Detection: R-CNN
Single-stage object detection algorithms simplify the process to a single step, directly predicting the location and class of objects from an input image. In contrast, two-stage algorithms first generate candidate regions and then perform classification and regression by focusing on those regions.
As shown in
Figure 3, Faster R-CNN is a classic two-stage algorithm [
54]. It is composed of four main modules: the Conv layer, Region Proposal Networks (RPN), Region of Interest (RoI) Pooling, and Classification.
The Conv layers extract feature maps from the input image. They are composed of common CNNs. The RPN generates multiple anchors and produces candidate regions, enabling the model to produce high-quality proposals while reducing computational costs. Additionally, this approach enhances the model’s scalability, making it easier to integrate with other networks. After RPN, the network uses ROI Pooling to achieve a fixed output size for feature maps, followed by the Classification module to complete the classification task.
A common two-stage algorithm, Mask R-CNN, contains an ROI Align network that uses bilinear interpolation to align the original image with the feature map, better preserving the spatial information of the ROI. However, it requires high-quality dense segmentation, necessitating a large amount of data and a great number of parameters.
The R-CNN series typically maintains excellent recognition rates in surface defect detection. Furthermore, due to its more precise handling of bounding boxes, the R-CNN series also excels in recognizing small objects. In the field of metal surface defect detection, Shen et al. replaced the backbone of the Faster R-CNN network with a residual network to address the issue of gradient explosion and used the K-means++ algorithm to redefine and select the optimal anchor box sizes during detection [
55]. Ye et al. used the ROI Align algorithm in Faster R-CNN to eliminate positional deviations in candidate frames and employed NMS to improve the model’s judgment of overlapping bounding boxes [
56]. Wang et al. incorporated a dual-feature PN into Mask R-CNN to increase the amount of semantic information the model can extract. Additionally, they adopted a new evaluation metric, the Complete Intersection over Union, to improve the network’s judgment of candidate boxes [
57]. Song et al. introduced deformable convolution into Faster R-CNN to enable the model to dynamically identify defects and designed a Suppression Block module to suppress the background of defect features, thereby making the defect features more distinct [
58]. Similarly, Ye et al. integrated deformable convolutions into Faster R-CNN and introduced a balanced pyramid for path aggregation. For label assignment, this model achieved dynamic candidate box allocation through threshold control [
59]. In the above-mentioned improvements to R-CNN models, Shen et al. [
55] and Ye et al. [
56] optimized the anchor boxes of the R-CNN series to enhance the model’s adaptability to complex backgrounds. Song et al. [
58] and Ye et al. [
59] (proposing a Robust Faster R-CNN) improved the network’s feature processing, such as incorporating deformable convolutions to enhance the model’s recognition capability for defects of varying sizes. Additionally, Wang et al. [
57] improved the model’s evaluation criterion to optimize candidate-region localization. Collectively, these modifications further enhanced the R-CNN series’ performance in identifying metal surface defects.
Table 2 presents the information of the improved R-CNN models. It can be observed that most R-CNN-based improvements focus on region proposal optimization and adaptation to complex backgrounds. Studies enhance multi-scale anchor matching, refine anchor selection strategies, and introduce feature fusion mechanisms to improve localization accuracy under complex backgrounds and diverse defect shapes. These improvements strengthen the advantage of the R-CNN series in fine-grained region modeling.
However, compared with the YOLO series, R-CNN-based models generally require higher computational costs and slower detection speeds. Therefore, reducing computational costs while maintaining the advantage of precise region modeling has become a key direction for the further development of the R-CNN series.
2.3. Transformer-Based Recognition
The transformer network model is now widely applied in the field of natural language processing [
60,
61,
62]. Unlike CNNs, transformer models utilize attention mechanisms to discover relationships between various features, allowing them to handle relationships between any two positions in a sequence without distance constraints. At present, the transformer series of networks has also been applied in computer vision [
63,
64,
65]. Vision Transformer (ViT) is a variant of the original transformer for computer vision (CV). It converts image inputs and employs the Encoder module to process image sequences and extract features [
66]. Transformer-based models require significantly higher computational resources and are more complex than CNN-based networks for object detection. They are well suited to large datasets but tend to perform moderately on smaller datasets [
67,
68].
As shown in
Figure 4, ViT consists of three main modules: Linear Projection of Flattened Patches (i.e., the Embedding layer), Transformer Encoder, and Multilayer Perceptron (MLP) Head. For image data, the data format is (H, W, C), where H, W, and C represent Height, Width, and Channels, respectively. Yet, the input for traditional transformer models needs to be a sequence of vectors. In the ViT model, the embedding layer first processes the image data by dividing the image into patches of 16 × 16 size. These patches are then mapped to one-dimensional vectors using a CNN to facilitate the input of image data. A Class Token is added to focus on the relationships between individual patches and the overall image. Additionally, to better understand the context of each patch, corresponding positional encodings are included, allowing the model to discern the position of each image block within the original image and thereby aiding in the comprehension of the image’s structure.
The original transformer model includes both an encoder and a decoder. In the ViT model, only the Encoder is utilized, and its structure is shown in
Figure 5.
There are several steps included in the encoder:
- 1.
Performing the Layer Normalization operation on the data sequence output from the Embedding layer to prevent gradient vanishing;
- 2.
Mapping the data into Query (Q), Key (K), and Value (V) vector spaces, completing multi-head self-attention analysis through similarity calculation and weighted summation, and processing it with the original input Patch using the residual connection;
- 3.
Sending the processed data into the norm layer and using MLP to perform non-linear transformation on the output of the self- attention mechanism to enhance the model’s expressive ability.
The encoder output is fed into the MLP head to map the final ViT output to the target task’s output space and complete classification.
In addition to ViT networks, there is another transformer model designed to mimic convolutional networks: Swin Transformer (SwinT) [
69]. In SwinT, a sliding-window approach is introduced to enable cross-window information exchange, enabling it to handle images of arbitrary input sizes. However, this design can diminish its ability to understand global information and increase SwinT’s structural complexity.
The Detection Transformer (DETR) network is an end-to-end transformer architecture specifically designed for object detection [
70]. It has been widely adopted in this domain, and its specific structure is shown in
Figure 6.
DETR has five modules:
- 1.
Backbone: Serving as the network’s backbone, it initially extracts image features, producing rich feature maps.
- 2.
Encoder: Building on the features extracted by the CNN, it learns global information. It employs a positional encoding module to retain spatial information by adding position encodings, it enables better interaction and information transmission between features at different positions.
- 3.
Decoder: The input to it consists of two parts, the encoder output and object queries. The latter are a fixed number of learnable query vectors, typically set to 100, with each vector representing a target to be detected. During decoding, these query vectors interact with the encoder features to directly generate bounding boxes.
- 4.
Prediction heads: Composed of multiple feedforward neural networks (FFNs), the prediction heads output the final class probabilities and bounding box coordinates.
- 5.
The predicted boxes are compared with the ground-truth boxes, and a bipartite matching algorithm is utilized to update the loss function, facilitating the continued training of the model.
Compared with the YOLO series, which is also popular in object detection, DETR handles bounding box classification in a simpler way. In the YOLO series, the input image is first divided into fixed-size grids, and multiple candidate boxes are generated within each grid to accommodate objects of different sizes and aspect ratios. In contrast, DETR utilizes a transformer architecture to directly output the coordinates of bounding boxes, resulting in a more streamlined overall process that does not require the setting of prior knowledge.
Compared to YOLO and R-CNN series, transformer-based networks can analyze contextual information in images, enabling the generation of richer feature representations and often achieving higher accuracy. Many researchers have further optimized transformer-based models for metal surface defect detection, as summarized in
Table 3: Wei et al. proposed a detection model based on ViT to address the issues of varying defect sizes and class imbalance in metal defect detection. The model employs convolution with receptive field attention and a context broadcasting median module that optimizes the distribution density of attention maps through median pooling, thereby improving defect detection accuracy under complex backgrounds [
71]. Jeong et al. combined the ResNet-50 network with the ViT network, utilizing ResNet-50 for feature extraction at both high and low levels of the network while employing ViT to organize contextual information within the model [
72].
In the field of SwinT networks, Zhu et al. proposed a detection model, LSwin Transformer. It combines dilated and standard convolutions to enlarge the receptive field and adopts a novel cross-window strategy for computing attention [
73]. Huan Liu et al. integrated a dual-branch network with the SwinT, utilizing spatial and channel attention mechanisms, as well as bilinear interpolation to enhance edge detection. This approach improves the extraction of global semantic information for steel strip surface defect detection [
74]. Additionally, Huang et al. incorporated Swin-T into the neck module of YOLO to enhance the model’s capability for understanding deep semantic features. They also embedded the SE module into YOLO’s CSP module to strengthen channel information and further refined the FPN structure to better fuse multi-scale information [
75].
In the field of DETR, Xia et al. designed a foreground supervision module in DETR to progressively extract foreground features based on feature scores. They proposed a cascaded hybrid matching strategy to increase the number of positive samples and avoid Non-Maximum Suppression post-processing [
76]. Mao et al. combined the lightweight MobileNetV3 with a multi-path feature fusion module based on the Real-Time Detection Transformer, significantly reducing the model’s parameter size and computational complexity. Moreover, they proposed an improved MPDIoU loss function to improve bounding box regression accuracy [
77]. In the aforementioned improvements based on transformer models, Wei et al. [
71] improved the model’s attention mechanism, thereby enhancing its ability to understand features; Jeong et al. [
72] introduced a residual network and modified the window shifting strategy to strengthen the model’s feature extraction capability; Xia et al. [
76] and Mao et al. [
77] optimized the network’s post-processing. These transformer-based classification methods have achieved considerable success on a dataset for metal surface defect detection.
Table 3 presents the information on the improved transformer-based models. It can be observed that most improvements focus on optimizing the attention mechanism and improving information extraction capability. Related studies enhance attention modules, refine window partition strategies, and improve feature extraction mechanisms to strengthen the model’s ability to capture complex defects. These efforts highlight the potential value of transformer models in attention-based modeling.
However, transformer-based models face such problems as high computational costs and slow detection speeds. Although some lightweight designs have achieved improved inference efficiency, there is some room to improve accuracy [
77]. Overall, future research on transformer-based models should balance modeling capacity and computational efficiency to meet the requirements of industrial deployment.
3. Image Processing Techniques
In the field of metal surface defect detection, different detection targets in an image have different physical sizes. In order to help the network better handle the detection targets with different sizes, Pyramid Networks have been widely used in the model for image feature processing [
78,
79,
80,
81,
82].
Traditional image pyramids usually process images independently at each scale, extracting features that are more basic and may lead to high computational overhead due to the need to repeatedly compute images at different scales [
83]. The Feature Pyramid Network (FPN) aims to effectively utilize multi-scale features by combining high-level and low-level features through lateral connections to form a richer feature representation. Its structure is shown in
Figure 7.
FPN consists of two parts: (a) a bottom-up pathway from shallow layers to deep layers; and (b) a top-down pathway from deep layers to shallow layers. These two parts are linked by lateral connections. In FPN, feature extraction is first performed by the backbone network, and after multiple convolutional layers, the feature map size is gradually reduced, while semantic information is enriched. These features undergo a convolutional alteration channel, followed by feature fusion in the top-down network. In a top-down network, the features of the uppermost layer are usually up-sampled by up-sampling methods (e.g., interpolation) to resize the high-level feature map to match the size of the network, and then summed or spliced together to form a new feature map, which contains the semantic information on the higher layers while retaining the details of the lower layers.
FPN can be further integrated with other popular network architectures, such as FPN with YOLO or FPN with transformers. In fact, this has already become a current trend. FPN can be optimized according to different specific scenarios. Currently, many scholars have further improved it in the task of metal surface defect detection. Some improved models of FPN-based networks are summarized in
Table 4. Han et al. proposed a pyramid structure for small object detection, SA-FPN. By introducing a multi-scale fusion module with perceptual ability, they guided the model to solve the problem of information loss for small surface defects during the model filtering process [
84]. To enhance the model’s ability to handle various metal surface defects of different sizes, Du et al. designed an adaptive focusing PN (Foc-FPN) [
85]. They input multi-level features into the model and made it focus on the spatial information connections between defects. Zhou et al. proposed a CM-FPN based on the attention mechanism [
86]. In this structure, both global attention and channel attention were considered simultaneously, enabling the model to focus on the parts rich in important information among different layers. While considering the model’s speed, zhu et al. introduced a feature fusion block into the pyramid structure to extract information between different layers and thus conduct interactions between different layers [
87]. In the above improvements to the pyramid model, the mutual connections between different layers were all emphasized. This is also an advantage of the pyramid model structure, which can better integrate the model’s information and thus achieve information interactions.
Table 4 presents the information on the improved FPN modules. It can be observed that most improvements based on FPN focus on multi-scale feature fusion. The related studies listed in the
Table 4 optimize the pyramid information structure, enhance feature attention, and integrate multi-scale feature representation mechanisms to improve defect recognition performance.
This trend indicates that multi-scale feature representation based on FPN has clear advantages in metal surface defect detection. It is particularly effective in scenarios with significant scale variation. However, its use tends to increase model complexity. Therefore, its induced changes in inference efficiency should also be carefully considered.
4. Metal Surface Defect Dataset
In the field of metal surface defect detection, several publicly available datasets can be used for research and applications.
The NEU-DET Surface Defect Database, created by Northeastern University [
88], contains surface defect scenarios of steel strips, covering six different types of defects: scale, spots, cracks, pitting, and dents. Each defect type has 300 samples, totaling 1800 images in this dataset. The sample in this dataset shows surface defects in
Figure 8. However, the intra-class defects exhibit significant variation, while the inter-class defects are overly similar, posing challenges for their accurate classification.
The GC10-DET Dataset focuses on surface defects of steel plates [
89], featuring ten main types of defects, including punch holes, weld seams, and crescent-shaped gaps, all collected from real-world scenarios. The schematic diagram of the surface defects is shown in
Figure 9. After reorganization, the dataset contains 2294 grayscale images. It was compiled by Lv and others from Tianjin University.
The Tianchi Aluminum Surface Defect Dataset is an aluminum surface defect dataset released during the Tianchi Data Competition [
90]. It is also sourced from real industrial scenarios, utilizing images from leading companies in the aluminum profile sector. The competition provided a total of ten thousand defective pictures with flaws, including defects such as paint blisters, dirt spots, and scratches, providing important references for aluminum processing and quality control. Some samples of the surface defects are shown in
Figure 10. The dataset exhibits an imbalanced data distribution, and there is considerable variation in defect specifications across categories, which adds difficulty to the classification task.
The X-SDD dataset is a surface defect dataset for hot-rolled steel strips, which contains a total of 1360 defective images [
91]. In the X-SDD dataset, the resolution of each defective image is 128 × 128 pixels, and it includes seven different types of surface defects, such as red scale, scale ash, surface scratches, etc. Some samples of the surface defects are shown in
Figure 11. Compared with the classic NEU-DET defect dataset, X-SDD provides more types of defects and is suitable for application in the field of industrial defect detection.
The RSDDs dataset is a dataset about the surface defects of railway tracks, it is marked by professional inspectors, providing a reliable foundation for detection [
92]. The surface defects in the RSDDs dataset are divided into Type I and Type II, covering a total of two different sources of railway track data collection: The Type I RSDDs dataset is collected from the fast lane, and there are 67 images in this subset; the Type II RSDDs dataset is collected from ordinary or heavy-duty transportation tracks, which contains 128 images. Examples of the surface defects are shown in
Figure 12.
Table 5 lists the detailed information on the above-mentioned datasets, including data source, defect type, image size, number of images, number of classes, annotation type, imbalance ratio, link, and related research. This information should help readers better understand these datasets and use them effectively.
Table 6 presents the typical mAP ranges for the YOLO, R-CNN, and transformer series methods on three mainstream industrial defect detection datasets: NEU-DET, GC10-DET, and RSDDs. All performance values in the table represent approximate typical ranges reported for each type of method on the corresponding dataset. Due to differences in network backbones, input image sizes, and training strategies used in different studies, the results across different method categories and within the same category are not strictly comparable. They are provided only to reflect the overall performance level of each type of method.
The five metal surface defect datasets recommended in this paper are all based on industrial scenarios. Compared with academic datasets, industrial datasets often show a clear long-tail distribution. Some defect categories contain only a small number of samples, and multiple defect types may appear in a single image. In contrast, academic datasets usually have relatively balanced categories and sufficient samples. In terms of image acquisition, academic datasets are often collected under controlled lighting and shooting conditions, with limited interference. In real industrial environments, data collection is often affected by reflection, noise, and complex texture interference. As a result, industrial datasets present greater diversity in image content and quality.
At the annotation level, labeling in industrial datasets is often influenced by subjective judgment. Academic datasets are usually reviewed through standardized manual verification procedures. This makes defect detection on industrial datasets more complex and increases the difficulty of a detection task. Differences in statistical distributions and data-acquisition mechanisms create a gap between academic and industrial datasets.
For small defect detection, GC10-DET and the Tianchi Aluminum Surface Defect Dataset are recommended industrial datasets. Compared with other datasets, these two datasets contain high-resolution images collected from real industrial environments. The defect scales cover a broader range, and some defects occupy only a very small proportion of the entire image. In GC10-DET, defect categories include small targets such as a rolled pit. Compared with NEU-DET, GC10-DET is more challenging in terms of category diversity and variations in the defect scale. It can be used to effectively evaluate a model’s ability to detect tiny defects. The Tianchi Aluminum Surface Defect Dataset also contains many small defects, such as stains and dents. The background is complex and includes various interference factors. It is closer to real industrial inspection scenarios and can be used to effectively assess the practical performance of a model in complex real-world environments.
5. Evaluation Metrics
In the field of metal surface defect detection, there are many metrics available to evaluate the performance of a model and assess specific detection scenarios to use, like Precision [
103], Recall [
104], Average Precision [
105], Mean Average Precision [
106], and Frames Per Second [
107].
Precision (P) is a key metric for assessing the performance of classification models. It assesses the model’s accuracy in predicting positive classes. Precision is defined as the ratio of the number of true positive samples correctly predicted as positive to the total number of samples predicted as positive, i.e.,
where
represents the number of true positive samples correctly predicted as positive, and
represents the number of false positive samples incorrectly predicted as positive. Precision effectively reflects the accuracy of the model in predicting the correct samples. It is particularly important in applications where reducing false positives is critical. However, precision does not take into account the false negatives. Therefore, focusing solely on precision in imbalanced datasets may lead to a misjudgment of model performance.
Recall (R) is primarily used to measure the model’s ability to identify positive samples. It is defined as the ratio of the number of true positive samples correctly identified by the model to the total number of actual positive samples, which includes TP and false negatives, i.e.,
where
represents the number of true positive samples correctly identified as positive, and
represents the number of false negative samples that are not identified as positive. The advantage of recall is that it emphasizes the model’s comprehensiveness in identifying positive samples, making it suitable for scenarios where the cost of missing detections is high. However, the drawback is that it may ignore the false positive cases, which leads to the need for combining precision and recall to comprehensively evaluate model performance in practical applications.
Average precision (AP) is a fundamental and well-defined metric for assessing the accuracy of metal surface defect detection. It serves as an intuitive measure of a model’s ability to detect a specific category of objects, where higher AP values indicate better detection performance for that category. Specifically, AP is defined as the area under the precision–recall curve and is typically computed using standard numerical integration or approximation methods. The calculation process is as follows:
The Mean Average Precision (mAP) is a commonly used evaluation metric in object detection tasks, designed to quantify the precision and recall of a model in detecting objects. mAP is calculated as the arithmetic mean of the AP. For
C-class detection tasks, we define:
where
represents the average precision of the
i-th class.
In object detection, mAP comprehensively reflects a model’s overall performance in object localization and recognition, providing a critical reference for model selection and optimization.
Frames Per Second (FPS) is a metric used to evaluate the real-time performance of a detection model. It represents the number of image frames that the model can process per unit time. A higher FPS indicates faster inference speed and stronger real-time capability. FPS is typically defined as the number of images processed within a given time period, i.e.,
where
denotes the total number of processed images, and
denotes the total time taken to complete the inference of these images.
In
Table 7, we present a statistical analysis of six metal surface defect detection networks on NEU-DET based on different backbone models. The reported metrics include mAP, number of parameters, and FPS. We also provide the input image size used during training, the dataset split, and whether data augmentation was applied. The evaluated models include the Improved Dilated Neighborhood Attention Transformer (DINAT), a Fast R-CNN model with Swin Transformer by Li et al. [
108], and Faster R-CNN, YOLOv5s, and Improved YOLO models designed and validated by Yang et al. [
109]. Due to variations in training settings across different studies, the reported results are intended to reflect overall performance ranges.
Among these models, Improved DINAT achieves the highest mAP of 83.7%, which reflects the strong recognition capability of transformer-based models. The Improved YOLO model achieves the highest FPS of 70.2, maintaining the high detection speed of the YOLO series. In terms of parameters, either of YOLOv5s and Improved YOLO model haves about 7 million parameters. This is significantly fewer than those of the transformer-based and R-CNN-based models that have more than 60M parameters [
108].
6. Comparative Critical Analysis of Detection Paradigms
6.1. Performance and Trade-Off Comparison of YOLO, R-CNN, and DETR
In the field of metal surface defect detection, the engineering application of detection models is often affected by their detection accuracy, real-time performance, deployment requirements, and other related factors. YOLO, R-CNN, and DETR in the transformer framework represent three widely used detection paradigms. Due to their different architectural design principles, they exhibit clear differences in detecting small-scale, irregular, and complex-textured background defects.
The existing studies mainly focus on model structures and performance reports. They seldom explore the intrinsic relationship between paradigm characteristics and defect properties. This section provides a structured analysis of the three paradigms from the perspectives of performance range, task adaptability, and other relevant aspects. It further examines the coupling relationship between these detection paradigms and metal surface defect detection. The aim is to clarify the applicability boundaries and optimization directions of different paradigms.
Table 8 presents a detailed performance comparison of different detection paradigms. From the perspective of mAP performance, the DETR series achieves higher detection accuracy among the three paradigms. This advantage is attributed to its global attention mechanism, which helps capture weak features and distinguish defects under complex backgrounds. In terms of model size, the YOLO series models are relatively small. This property is beneficial for fast training and convenient deployment. In contrast, the R-CNN and DETR series adopt more complex architectures and contain more parameters. Regarding inference speed, the DETR series operates at a moderate FPS due to the computational cost of its attention mechanism. The R-CNN series is affected by its two-stage design and thus has lower FPS than YOLO models.
From the perspective of sensitivity to small defects, earlier YOLO models showed moderate performance on small object detection [
110]. They relied on multi-scale feature fusion to improve the perception of small targets. In recent versions, the YOLO series has achieved clear improvements in this aspect. The R-CNN series enhances local feature representation through region cropping. This strategy is beneficial for small defect localization to some extent. However, its recall performance depends on the quality of region proposal generation. The Transformer-based series performs global modeling through self-attention. It demonstrates a strong representation capability under weak features and complex backgrounds, and it can detect small defects through a global attention mechanism.
In terms of deployment costs, the YOLO series adopts a relatively lightweight architecture and provides fast inference speed. It is more suitable for industrial online inspection and deployment on edge devices. The R-CNN series introduces additional computational overhead due to its two-stage design, which leads to a higher deployment cost. The transformer-based series has medium to high computational complexity and requires considerable computing power and memory resources.
In terms of core modeling strategies, YOLO combines end-to-end dense prediction with convolution-based local modeling. The R-CNN series is built on a two-stage framework, which includes region proposal generation and local feature extraction. The transformer-based series performs overall modeling through anchor-free prediction and a global attention mechanism. The structural differences among these three paradigms has largely determined their applicability boundaries and engineering feasibility in metal surface defect detection.
6.2. Coupling Analysis of Detection Paradigms and Defect Features
In metal surface defect detection tasks, defects are often small in scale and irregular in shape. The detection process also faces complex textured backgrounds. In addition, metal surface defects are usually rare. As a result, available defect samples are often limited in size. For small datasets, different detection paradigms exhibit distinct representation capabilities across defect types.
In metal surface defect detection, small defects are common. These small-scale defects occupy limited regions in images and contain weak feature information. During feature extraction and sampling, their representations are easily suppressed, which makes them more difficult to detect. In recent versions, the YOLO series introduces a feature pyramid structure to improve small-object detection. Through this structure, shallow high-resolution features participate in prediction, which enhances the perception of small targets. The Feature Pyramid Network (FPN) plays an important role in multi-scale feature fusion. Through top-down information flow and cross-layer feature integration, FPN enriches high-resolution feature maps with deep semantic information. This process improves the representation ability for small objects and alleviates the limitation of YOLO in detecting small defects. The R-CNN series applies a region proposal mechanism and performs region cropping on candidate areas. This strategy increases the proportion of small defects within local regions and strengthens localization performance. However, the region proposal process depends on feature map resolution and scale design. When the defect size approaches the down-sampling limit, the recall rate tends to decrease. The DETR series adopts an anchor-free prediction strategy and introduces multi-scale attention mechanisms. It dynamically aggregates features at different scales. This design eases the scale constraints caused by anchor boxes and shows certain advantages in recognizing small-scale defects.
In metal surface defect detection tasks, defects such as cracks and dents often show significant shape variations and irregular boundaries. Both anchor-based YOLO and R-CNN series rely on rectangular bounding boxes. This constraint may lead to fitting bias for complex shapes. In scenarios with slender or curved cracks, localization errors may occur. Although FPN improves multi-scale feature representation, it does not remove the geometric limitation of anchor boxes. Therefore, the adaptability to highly irregular targets is still restricted by the structural design. The DETR series adopts an anchor-free prediction strategy and establishes global feature relationships through self-attention. This mechanism allows the model to capture object shape characteristics over a wider spatial range. It shows structural advantages when handling defects with large shape variations.
Industrial metal surfaces often contain obvious texture patterns and background noise. In some cases, the difference between defect features and the background is small. This situation may lead to missed detection and false positives. YOLO and R-CNN series rely on convolution to extract local features. Their receptive fields expand layer by layer. Given strong textured backgrounds, their ability to distinguish defects from the background is limited. FPN enhances feature representation by fusing semantic information from different levels. This design improves robustness to interference. However, it still follows a local convolution paradigm. The DETR series establishes cross-region feature modeling through a global attention mechanism. It strengthens defect-related features and suppresses background interference at a global scale. In complex texture or low-contrast scenarios, its feature representation is relatively stable.
When constructing datasets for metal surface defect detection, high data acquisition costs and annotation difficulty are common challenges. As a result, the available dataset size is often limited. The convolution-based R-CNN and YOLO models have a strong structural inductive bias. Through shared convolutional kernels, the models can learn stable feature representations. This property is beneficial under small-data conditions and can help the training process converge more reliably. In addition, FPN strengthens multi-scale features through cross-layer fusion. When information is limited, low-level features can provide useful spatial details. In contrast, the Transformer-based DETR series relies on self-attention for comprehensive modeling. The self-attention mechanism requires large-scale data to learn effective parameter representations. It establishes spatial and feature relationships mainly through data-driven learning, rather than structural constraints. When the dataset size is limited, this characteristic may lead to overfitting or slow convergence.
7. Future Work
The detection of metal surface defects in metals is an indispensable aspect of the manufacturing industry. With the continuous development of technology, there are now many solutions for the detection of metal surface defects. However, when dealing with some relatively complex surface defects, there is still a great deal of room for improvement in the classification performance of existing models. Compared with purely technical evolution, the current field requires analysis that starts from the fundamental nature of the problem. It is necessary to conduct a systematic examination of the key challenges and to develop future research directions around these core issues.
Based on existing studies and industrial practice, metal surface defect detection faces several key challenges. These include (1) multi-scale variation and the difficulty of small object detection and (2) the trade-off between detection accuracy and real-time performance, data scarcity and class imbalance, model complexity and computational efficiency, model interpretability, and domain shift. These issues represent important research topics in the field of metal surface defect detection. A systematic discussion of these challenges is necessary to promote the continuous development of related technologies.
7.1. Multi-Scale Variation and Difficulty of Small Object Detection
Compared with other detection tasks, metal surface defects often show significant variations in size. This characteristic challenges the ability of models to detect small objects and handle multi-scale defects. The studies on YOLO handling industrial scenarios focus on multi-scale feature fusion [
111,
112]. The results indicate that, when detecting datasets with large size differences and small objects, YOLO has room for improvement in multi-scale modeling. To address this issue, deformable convolution can be introduced to adaptively extract features at different scales, thereby enhancing a YOLO model’s adaptability to scale variation.
In addition, to handle significant size variation and small object detection, most studies adopt the FPN structure to further improve the three mainstream models, i.e., YOLO, R-CNN and transformer. FPN integrates multi-scale information at each level and fuses features of different resolutions, including small-scale features. This design enables the network to process defects of various sizes more effectively. Future research can further explore improvements and applications of the FPN structure in different detection models.
7.2. Trade-Off Between Detection Accuracy and Real-Time Performance
Since most metal surface defect detection tasks arise from industrial environments, the trade-off between detection accuracy and inference speed becomes a key issue. A YOLO model adopts a one-stage detection framework and formulates prediction as a regression problem. This design improves inference speed, but its accuracy is often lower than that of two-stage methods. Therefore, further research can focus on improving the accuracy of the YOLO series. For example, effective attention mechanisms can be introduced into it;, and hybrid architectures that combine YOLO with other frameworks can be designed.
In contrast, the R-CNN series generally achieves high accuracy, especially in complex background conditions. However, the complex processing of region proposals increases model complexity and reduces inference speeds. Its real-time performance is weaker than that of YOLO. In addition, R-CNN heavily depends on the quality of region proposals. Future research needs to focus on improving its detection accuracy while maintaining real-time performance. Possible directions include refining the region proposal module or constructing hybrid structures that integrate one-stage and two-stage designs to achieve a better performance and speed balance.
The transformer-based series performs global modeling through self-attention and shows strong accuracy potential for defect detection subject to complex backgrounds. However, its computational complexity is high, and the inference cost on high-resolution industrial images is considerable. This limitation restricts its real-time deployment. Therefore, reducing its computational cost and developing its lightweight version through, e.g., knowledge distillation [
113,
114,
115], are key directions for balancing accuracy and real-time performance in transformer-based models.
7.3. Data Scarcity and Class Imbalance
In practical metal manufacturing environments, the number of defect samples is much smaller than that of normal samples. There is also a clear imbalance among different defect categories. This defect data scarcity limits the generalization ability of deep learning models and increases the risk of overfitting.
To address data scarcity and class imbalance in metal surface defect detection, future research can explore self-supervised learning and transfer learning strategies [
116,
117]. By pretraining on a large number of unlabeled industrial images, models can learn more general feature representations and reduce the dependence on manually annotated data. In addition, data augmentation methods based on generative models can synthesize diverse defect samples [
118,
119]. This approach helps alleviate class imbalance and improves the recognition ability for minority defect categories [
120,
121].
When defect samples are extremely limited, unsupervised anomaly detection methods are of great importance [
122,
123]. These methods do not require extensive defect annotations. They first learn the feature distribution of normal samples. Regions that significantly deviate from the learned normal pattern are then identified as potential defects. In this way, the reliance on labeled defect data can be effectively mitigated. Future research can further improve anomaly decision mechanisms, for example, by refining feature difference measurement or enhancing feature representation capability, in order to improve detection stability and practical performance.
7.4. Model Complexity and Computational Efficiency
In YOLO, R-CNN, and transformer-based models, the transformer architecture processes all input tokens through global attention. It considers the complete input information during computation, which results in relatively high model complexity. Therefore, improving the efficiency of transformer-based models has become an important research issue.
In recent years, some studies have attempted to combine CNN and transformer architectures [
124,
125]. CNN models are effective in extracting local features and require lower computational cost. By integrating CNN and transformer techniques, it is possible to reduce the excessive complexity caused by pure transformer models.
The Mamba model is a novel state space model. Compared with the quadratic computational complexity of the transformer series, Mamba has linear computational complexity. It achieves efficient feature representation through selective information modeling [
126]. In the field of computer vision, Zhu et al. proposed a Mamba model for image processing, namely Vision Mamba (ViM) [
127]. In the ViM model, the image is first divided into patches and projected into tokens, which are then fed into the ViM encoder. Its bidirectional neural network module is used to enhance its ability to understand the context of an image. Future research should consider using the Mamba model to deal with defects in the field of metal surface defect detection, or to explore the hybrid architectures of the Mamba model and other models.
7.5. Model Interpretability
In the field of metal surface defect detection, the application of deep learning has improved detection accuracy and efficiency. However, it also introduces challenges related to model interpretability. Deep learning models are often regarded as black-box systems [
128,
129]. Although they can provide detection results, their internal decision process is difficult to understand directly. In metal surface defect detection, a model may identify a region as defective but cannot clearly explain the specific basis of its judgment. This uncertainty may introduce risks in industrial applications. Misjudgment can affect a production process. Therefore, in practical deployment, it is necessary not only to focus on detection accuracy but also to enhance the interpretability of model decisions to achieve a clearer and explainable detection process.
To improve interpretability, attention visualization and feature response analysis can be applied during detection. These methods present the regions emphasized by the model in a visual form. For example, heat maps can be generated to observe whether the model mainly focuses on defect regions or is influenced by background interference. Such visualization helps engineers evaluate whether the model’s decisions are reasonable. It can also support the further optimization of training strategies and improve model stability. Future research may explore simpler inspection tools to increase the transparency and reliability of deep learning models.
7.6. Domain Shift and Cross-Scenario Generalization
During the collection of metal surface defects, variations in production lines, lighting conditions, and material types may cause changes in the distribution of defect images [
115]. When a model is trained on one dataset and applied to a different environment, its detection performance may decline significantly. At present, many studies focus on validation using only a limited number of datasets. Some datasets, such as the Tianchi aluminum surface defect dataset and X-SDD, are used less frequently. This concentration on specific datasets may lead to over-optimization within a limited data distribution and a lack of cross-dataset and multi-scenario validation. When such models are deployed in new metal surface defect detection tasks, their performance may be affected.
Future research should strengthen cross-domain generalization from several aspects. First, model evaluation should not rely on a single dataset. Comprehensive validation across multiple datasets is necessary. Second, domain adaptation methods can be explored to reduce distribution differences among datasets and improve model performance in unseen environments [
130]. In addition, datasets that include diverse types of metal surface defects need to be constructed and used in model training so as to improve model adaptability. Multi-scenario training and the validation of deep learning models should be conducted to enhance their cross-domain robustness.
8. Conclusions
Accurate and fast metal surface defect detection is required to achieve the highly desired quality and structural safety for numerous industrial products. It is also a key technology for promoting intelligent manufacturing and automated quality control. With the rapid development of industrial automation, traditional manual inspection and conventional machine learning methods can no longer meet the requirements for high accuracy and real-time performance in complex scenarios. In recent years, the introduction of deep learning into this area has greatly advanced the metal surface defect detection technology.
This review has systematically analyzed the current applications of mainstream deep learning frameworks to metal surface defect detection. These frameworks include CNN-based models, such as the YOLO series and R-CNN, as well as detection models based on the transformer architecture. The YOLO models, which follow a one-stage detection strategy, show clear advantages in real-time detection and industrial deployment. They can achieve efficient inference in resource-limited environments. The R-CNN series, which adopts a two-stage strategy, provides high localization accuracy under complex background conditions. Transformer-based models enhance feature representation by modeling global dependencies. They show strong potential in coping with complex defect scenarios. In addition, the Feature Pyramid Network, which enables multi-scale feature fusion, plays a key role in handling defects of different sizes. It is an important structural component for improving the detection performance of models. This paper has also introduced several related metal surface defect datasets. These datasets provide important benchmarks for model validation and performance comparison.
Although deep learning methods have achieved continuous improvements in accuracy for metal surface defect detection in recent years, several challenges remain open when they are put into practical industrial use. Metal surface defects often vary greatly in size and have irregular shapes. This makes multi-scale modeling and small defect recognition difficult. The trade-off between accuracy and real-time performance among different models, as well as the conflict between model complexity and computational efficiency, limits their practical deployment. In addition, data scarcity, class imbalance, and domain shift issues place higher demands on model generalization. Limited model interpretability also affects model transparency and understanding by engineers during detection. Future research should focus on efficient multi-scale modeling, lightweight architecture design, and improved cross-domain robustness of developed models. It is important to enhance interpretability and engineering adaptability while maintaining a fine balance between accuracy and efficiency.
In general, metal surface defect detection is shifting from optimizing the performance of a single model to coordinated optimization in multiple aspects. Through innovations in algorithm design, improvements in multi-scale fusion techniques, and optimization of deployment strategies, it is possible to achieve a balance among detection accuracy, real-time performance, and generalization ability. This progress should provide strong support for reliable quality control in intelligent manufacturing and 3D printing environments [
131,
132,
133].