Applications of Deep Learning to Metal Surface Defect Detection: Status and Challenges

Wang, Yizhe; Zhou, Mengchu; Zhang, Chenyang; Sedraoui, Khaled

doi:10.3390/pr14081305

Open AccessReview

Applications of Deep Learning to Metal Surface Defect Detection: Status and Challenges

by

Yizhe Wang

¹,

Mengchu Zhou

^1,2,*

,

Chenyang Zhang

¹ and

Khaled Sedraoui

³

¹

Faculty of Innovation Engineering, Macau University of Science and Technology, Macau, China

²

Helen and John C. Hartmann Department of Electrical and Computer Engineering, New Jersey Institute of Technology, Newark, NJ 07102, USA

³

The Department of Electrical and Computer Engineering, Faculty of Engineering, and also with K. A. CARE Energy Research and Innovation Center, King Abdulaziz University, Jeddah 21589, Saudi Arabia

^*

Author to whom correspondence should be addressed.

Processes 2026, 14(8), 1305; https://doi.org/10.3390/pr14081305

Submission received: 13 February 2026 / Revised: 24 March 2026 / Accepted: 5 April 2026 / Published: 19 April 2026

(This article belongs to the Topic Intelligent Maintenance and Health Management in Smart Manufacturing)

Download

Browse Figures

Versions Notes

Abstract

The detection technology for metal surface defects plays a crucial role in improving metal product quality and production efficiency in various manufacturing and 3-D printing factories. Metal defect detection faces scale variation and irregular shapes, which limit the adaptability of general object detection models in industrial scenarios. Deep learning-based methods are widely used for metal surface defect detection due to their strong adaptability and high automation. Yet, their existing studies pay limited attention to adaptability, evaluation, and recommendations across different detection methods for metal surface defects. This work mainly discusses YOLO, R-CNN, and transformers, as well as FPN, and analyzes their applications in metal surface defect detection according to their respective characteristics, to provide guidance for future research. YOLO has advantages in real-time industrial online detection, while R-CNN and transformer models show potential advantages in handling complex defect cases. Additionally, this work summarizes commonly used datasets and evaluation metrics for metal surface defect detection and analyzes the benchmark performance of different types of detection methods. It also discusses future research directions, including the current status and improvement paths of different models in terms of accuracy, real-time performance, and adaptability. Future models should focus on balancing accuracy and real-time performance, exploring new hybrid architectures, and improving adaptability to different metal surface defects to support further development in this field.

Keywords:

metal surface defect detection; object detection; deep learning; YOLO; R-CNN; transformer; feature pyramid network; additive manufacturing; 3D printing

1. Introduction

Metal surface defect detection has been a critical issue in ensuring high-quality metal product manufacturing and 3D printing [1,2,3,4,5], and one of the most important issues is surface defects [6,7,8,9]. Common metal surface defects include cracking, scratching, patching, and corrosion, which are frequently observed on the surfaces of metal objects [10]. These defects often affect the yield of metal product manufacturing, etc. [11,12,13]. Defect detection is currently used in many industrial scenarios. For example, surface defect detection can be used to check for structural defects, such as dents and corrosion on racks, as well as pallets in warehouse management [14] and surface cracks and roughness in 3D metal printing [15]. It can be used to assess the surface-forming conditions of printed metal items, enabling the timely monitoring of product status.

In the field of surface defect detection, Zorić et al. [16] extracted the intensity of each frequency component by computing the Fourier spectrum of defect images, thereby extracting features for classification. Ni et al. [17] proposed a method based on edge partition features, utilizing adaptive partitioning and edge feature extraction to improve detection accuracy. Gyimah et al. [18] employed complete local binary patterns to discriminate among information from defects and perform classification. These traditional classification methods have achieved certain success in industrial detection.

However, metal surface defect detection is a challenging task. Defects on metal surfaces often show significant scale variation. They range from micro-cracks to large defect areas. This wide-scale range makes it difficult for a model to capture the fine details of small targets while preserving the overall structure of large targets [19,20]. In addition, defects usually have irregular shapes. This requires the detection model to have stronger multi-scale feature fusion capability [21,22]. In industrial environments, the contrast between defects and the background texture is sometimes low, which can lead to blurred boundaries [23,24]. When dealing with complex defects, traditional metal surface defect detection methods lack the ability to learn from new data. Their generalization ability is also limited [25,26,27].

Deep learning-based classification methods have achieved remarkable classification results in the field of metal surface defect detection, such as You Only Look Once (YOLO) [28,29,30], R-CNN [31,32,33], and transformer [34,35,36]. They can automatically extract features from data for learning, reducing manual effort, and exhibit strong adaptability across various defect detection scenarios, achieving excellent recognition rates [37,38,39]. In addition, multi-scale feature fusion mechanisms, such as pyramid-based feature fusion methods, play an important role in improving object detection performance.

Figure 1 shows the development of metal surface defect detection in recent years. It includes four stages: traditional machine learning-based, Convolutional Neural Network (CNN)-based, Pyramid Network (PN)-based feature fusion, and transformer-based ones. It also highlights representative methods and summarizes the key development trends.

Although the above deep learning-based methods have achieved highly desired performance in practice, directly applying general detection models to metal surface defect scenarios leads to several challenges. Small defects are easily missed, and low-contrast defects are difficult to identify. Therefore, it is necessary to compare and analyze the adaptability of different model structures in industrial defect scenarios. It is also important to systematically review recent improvements to general models for metal surface defect detection.

Several studies have summarized research in this field, such as surveys on steel surface defect detection and reviews of CNN-based defect detection methods [40]. They provide useful references for understanding the development of metal defect detection techniques. However, the existing reviews involve certain limitations. On the one hand, most of them focus on a single material scenario. They lack a comprehensive perspective across different types of metals and do not provide a sufficient comparison for the latter. On the other hand, some of them mainly concentrate on traditional CNN architectures or earlier object detection models. They do not analyze the structural evolution of recently developed detection frameworks. At the data level, the scale, categories, and application differences of public metal surface defect datasets directly affect model evaluation and algorithm design. However, systematic summaries of these datasets are lacking. Therefore, it is necessary to develop a review focused on metal surface defects and to provide a systematic analysis of deep learning methods for metal surface defect detection.

In response to the above issues, this paper focuses on key challenges in industrial scenarios and comprehensively reviews the research progress of deep learning in metal surface defect detection. Compared with existing surveys [41], the main features and contributions of this review paper are as follows:

1.: It systematically introduces and compares CNN-based detection methods at different stages and transformer-related detection methods in the field of metal surface defect detection.
2.: It establishes a unified cross-architecture comparison perspective. It clarifies the differences among detection paradigms in terms of accuracy, real-time performance, and application scenarios. It also compares the performance of different paradigms on the NEU-DET dataset.
3.: It systematically summarizes the detailed information and detection applicability of commonly used public datasets in metal surface defect detection. It introduces evaluation metrics and analyzes the differences between academic and industrial data from the perspective of data distribution.
4.: It builds a structured challenge analysis framework for industrial scenarios. It reveals the core bottlenecks in metal surface defect detection, including multi-scale modeling, efficiency trade-offs, and cross-domain generalization. It also proposes hybrid architectures that combine lightweight design and emerging models, such as Mamba, as important directions for future technological development.

This review systematically summarizes the current research on deep learning methods for metal surface defect detection. Section 1 outlines fundamental studies in surface defect detection. Section 2 presents deep learning approaches, including single-stage and two-stage detection algorithms based on CNN, as well as transformer-based detection methods. Section 3 describes PN architectures for image processing. Section 4 introduces several commonly used datasets for metal surface defect detection. Section 5 discusses evaluation metrics specific to this field. Section 6 presents a comparative analysis of the performance of different detection models when applied to metal surface defect detection. Section 7 explores potential future research directions in metal surface defect detection. Section 8 concludes this review paper.

2. Deep Learning Algorithms

2.1. Single-Stage Detection: YOLO

The YOLO series has consistently been a popular network model in single-stage object detection, characterized by its strong real-time performance, fast detection speed, simple model architecture, and ease of deployment. It has been widely applied in numerous object detection tasks [42,43,44]. The classic YOLO series consists of a four-part framework: the input layer preprocesses images; the backbone performs general feature extraction; the neck processes the features to enhance their diversity and robustness; and the output layer produces the object detection results. Figure 2 illustrates a detailed schematic diagram of the four modules in a classic YOLO network:

1.: The Convolutional Block Layer (CBL) is composed of a convolutional layer and a batch normalization (BN) layer.
2.: ResUnit is a residual combination of the CBL module.
3.: Cross Stage Partial 1_X (CSP1_X) is composed of the CBL modules, ResUnit modules, and a convolutional layer.
4.: Cross Stage Partial 2_X (CSP2_X) is formed by connecting a convolutional layer with multiple ResUnit modules.

With the continuous development of YOLO, there are now many more versions, e.g., the highly popular YOLOv8 series [45]. Compared to previous versions, YOLOv8 further improves accuracy, featuring a richer gradient flow that enhances its feature learning capability. Additionally, it separates the classification and detection heads in the architecture, providing greater flexibility for modularization. However, when the models are of the same size type, the number of parameters in YOLO v8 is larger than that in YOLO v5. Therefore, if deploying the model to resource-constrained devices, YOLOv5 is more suitable [46].

YOLOv12 was released in February 2025. Compared with other YOLO-series networks, it incorporates the new FlashAttention mechanism. This addition enhances the model’s receptive field. Meanwhile, it further improves the inference speed with a minimal impact on model performance [47]. YOLOv13 proposes the Full-Pipeline Aggregation-and-Distribution Paradigm, which effectively alleviates information flow bottlenecks by establishing key information pathways across the backbone, neck, and head. Compared with YOLOv12, YOLOv13 enhances global context modeling through hyper-node-based representation, thereby improving recall in densely packed scenarios [48].

YOLO is a single-stage detector, notable for its strong real-time performance compared to multi-stage detectors, and it requires fewer computational resources than other models, e.g., R-CNN or transformer-based models. This makes it suitable for production tasks with limited resources and high real-time requirements. To further enhance the performance of YOLO and its major versions in the field of metal surface defect detection, we summarize the YOLO-related work for metal surface defect detection. Zhao et al. [49] introduced a dual feature pyramid in the neck network of YOLOv5 and replaced the convolutional layers in the backbone with the Res2Net network to enlarge the receptive field. To handle the defects of varying sizes in defect detection, Huang et al. [50] designed a deformable convolutional feature extraction module and incorporated a context enhancement mechanism into the YOLOv5 network to aggregate information within the network. Similarly, Fan et al. [51] integrated a context integration module into the YOLO network and employed a genetic algorithm to cluster and fine-tune the anchors fed into YOLO, thereby improving the correlation between the dataset and the algorithm. Xie et al. [52] used the Kmeans++ algorithm to optimize the quality of the initial anchors and introduced separable convolutions to extract features at different scales. Gao et al. [53] used a PN architecture to effectively preserve shallow feature information when designing the YOLO network. Additionally, they introduced Normalized Wasserstein Distance and Complete Intersection Over Union as a loss metric to improve the recognition ability for small targets. These optimized YOLO models were optimized in terms of multi-scale feature extraction capabilities and anchor adjustment. Zhao [49], Huang [50], and Gao [53] improved the multi-scale defect extraction capabilities of the defect detection network, enhancing the model’s ability to detect defects of different sizes. Fan [51] and Xie [52] optimized the network’s anchor selection, enabling the anchors to better adapt to the dataset’s target distribution. These methods have all improved the performance of the YOLO model in the field of metal surface defect detection and effectively enhanced the model’s applicability to complex defect scenarios.

Table 1 presents the information on the improved YOLO models. It can be observed that most YOLO-based improvements focus on two main directions: multi-scale feature extraction and anchor optimization. In YOLO models, detection performance can be improved by introducing an FPN and refining anchor box boundaries. These methods improve the detection of small targets and defects with scale variation. A clear performance trade-off can also be identified in these improvements. Multi-scale fusion and node enhancement can increase detection accuracy. However, they also lead to larger model size and higher computational complexity.

Overall, research on YOLO-based models for metal surface defect detection mainly aims to improve small target detection and multi-scale capability. At the same time, these studies seek to balance real-time performance and accuracy to meet the practical requirements of industrial online inspection.

2.2. Two-Stage Detection: R-CNN

Single-stage object detection algorithms simplify the process to a single step, directly predicting the location and class of objects from an input image. In contrast, two-stage algorithms first generate candidate regions and then perform classification and regression by focusing on those regions.

As shown in Figure 3, Faster R-CNN is a classic two-stage algorithm [54]. It is composed of four main modules: the Conv layer, Region Proposal Networks (RPN), Region of Interest (RoI) Pooling, and Classification.

The Conv layers extract feature maps from the input image. They are composed of common CNNs. The RPN generates multiple anchors and produces candidate regions, enabling the model to produce high-quality proposals while reducing computational costs. Additionally, this approach enhances the model’s scalability, making it easier to integrate with other networks. After RPN, the network uses ROI Pooling to achieve a fixed output size for feature maps, followed by the Classification module to complete the classification task.

A common two-stage algorithm, Mask R-CNN, contains an ROI Align network that uses bilinear interpolation to align the original image with the feature map, better preserving the spatial information of the ROI. However, it requires high-quality dense segmentation, necessitating a large amount of data and a great number of parameters.

The R-CNN series typically maintains excellent recognition rates in surface defect detection. Furthermore, due to its more precise handling of bounding boxes, the R-CNN series also excels in recognizing small objects. In the field of metal surface defect detection, Shen et al. replaced the backbone of the Faster R-CNN network with a residual network to address the issue of gradient explosion and used the K-means++ algorithm to redefine and select the optimal anchor box sizes during detection [55]. Ye et al. used the ROI Align algorithm in Faster R-CNN to eliminate positional deviations in candidate frames and employed NMS to improve the model’s judgment of overlapping bounding boxes [56]. Wang et al. incorporated a dual-feature PN into Mask R-CNN to increase the amount of semantic information the model can extract. Additionally, they adopted a new evaluation metric, the Complete Intersection over Union, to improve the network’s judgment of candidate boxes [57]. Song et al. introduced deformable convolution into Faster R-CNN to enable the model to dynamically identify defects and designed a Suppression Block module to suppress the background of defect features, thereby making the defect features more distinct [58]. Similarly, Ye et al. integrated deformable convolutions into Faster R-CNN and introduced a balanced pyramid for path aggregation. For label assignment, this model achieved dynamic candidate box allocation through threshold control [59]. In the above-mentioned improvements to R-CNN models, Shen et al. [55] and Ye et al. [56] optimized the anchor boxes of the R-CNN series to enhance the model’s adaptability to complex backgrounds. Song et al. [58] and Ye et al. [59] (proposing a Robust Faster R-CNN) improved the network’s feature processing, such as incorporating deformable convolutions to enhance the model’s recognition capability for defects of varying sizes. Additionally, Wang et al. [57] improved the model’s evaluation criterion to optimize candidate-region localization. Collectively, these modifications further enhanced the R-CNN series’ performance in identifying metal surface defects.

Table 2 presents the information of the improved R-CNN models. It can be observed that most R-CNN-based improvements focus on region proposal optimization and adaptation to complex backgrounds. Studies enhance multi-scale anchor matching, refine anchor selection strategies, and introduce feature fusion mechanisms to improve localization accuracy under complex backgrounds and diverse defect shapes. These improvements strengthen the advantage of the R-CNN series in fine-grained region modeling.

However, compared with the YOLO series, R-CNN-based models generally require higher computational costs and slower detection speeds. Therefore, reducing computational costs while maintaining the advantage of precise region modeling has become a key direction for the further development of the R-CNN series.

2.3. Transformer-Based Recognition

The transformer network model is now widely applied in the field of natural language processing [60,61,62]. Unlike CNNs, transformer models utilize attention mechanisms to discover relationships between various features, allowing them to handle relationships between any two positions in a sequence without distance constraints. At present, the transformer series of networks has also been applied in computer vision [63,64,65]. Vision Transformer (ViT) is a variant of the original transformer for computer vision (CV). It converts image inputs and employs the Encoder module to process image sequences and extract features [66]. Transformer-based models require significantly higher computational resources and are more complex than CNN-based networks for object detection. They are well suited to large datasets but tend to perform moderately on smaller datasets [67,68].

As shown in Figure 4, ViT consists of three main modules: Linear Projection of Flattened Patches (i.e., the Embedding layer), Transformer Encoder, and Multilayer Perceptron (MLP) Head. For image data, the data format is (H, W, C), where H, W, and C represent Height, Width, and Channels, respectively. Yet, the input for traditional transformer models needs to be a sequence of vectors. In the ViT model, the embedding layer first processes the image data by dividing the image into patches of 16 × 16 size. These patches are then mapped to one-dimensional vectors using a CNN to facilitate the input of image data. A Class Token is added to focus on the relationships between individual patches and the overall image. Additionally, to better understand the context of each patch, corresponding positional encodings are included, allowing the model to discern the position of each image block within the original image and thereby aiding in the comprehension of the image’s structure.

The original transformer model includes both an encoder and a decoder. In the ViT model, only the Encoder is utilized, and its structure is shown in Figure 5.

There are several steps included in the encoder:

1.: Performing the Layer Normalization operation on the data sequence output from the Embedding layer to prevent gradient vanishing;
2.: Mapping the data into Query (Q), Key (K), and Value (V) vector spaces, completing multi-head self-attention analysis through similarity calculation and weighted summation, and processing it with the original input Patch using the residual connection;
3.: Sending the processed data into the norm layer and using MLP to perform non-linear transformation on the output of the self- attention mechanism to enhance the model’s expressive ability.

The encoder output is fed into the MLP head to map the final ViT output to the target task’s output space and complete classification.

In addition to ViT networks, there is another transformer model designed to mimic convolutional networks: Swin Transformer (SwinT) [69]. In SwinT, a sliding-window approach is introduced to enable cross-window information exchange, enabling it to handle images of arbitrary input sizes. However, this design can diminish its ability to understand global information and increase SwinT’s structural complexity.

The Detection Transformer (DETR) network is an end-to-end transformer architecture specifically designed for object detection [70]. It has been widely adopted in this domain, and its specific structure is shown in Figure 6.

DETR has five modules:

1.: Backbone: Serving as the network’s backbone, it initially extracts image features, producing rich feature maps.
2.: Encoder: Building on the features extracted by the CNN, it learns global information. It employs a positional encoding module to retain spatial information by adding position encodings, it enables better interaction and information transmission between features at different positions.
3.: Decoder: The input to it consists of two parts, the encoder output and object queries. The latter are a fixed number of learnable query vectors, typically set to 100, with each vector representing a target to be detected. During decoding, these query vectors interact with the encoder features to directly generate bounding boxes.
4.: Prediction heads: Composed of multiple feedforward neural networks (FFNs), the prediction heads output the final class probabilities and bounding box coordinates.
5.: The predicted boxes are compared with the ground-truth boxes, and a bipartite matching algorithm is utilized to update the loss function, facilitating the continued training of the model.

Compared with the YOLO series, which is also popular in object detection, DETR handles bounding box classification in a simpler way. In the YOLO series, the input image is first divided into fixed-size grids, and multiple candidate boxes are generated within each grid to accommodate objects of different sizes and aspect ratios. In contrast, DETR utilizes a transformer architecture to directly output the coordinates of bounding boxes, resulting in a more streamlined overall process that does not require the setting of prior knowledge.

Compared to YOLO and R-CNN series, transformer-based networks can analyze contextual information in images, enabling the generation of richer feature representations and often achieving higher accuracy. Many researchers have further optimized transformer-based models for metal surface defect detection, as summarized in Table 3: Wei et al. proposed a detection model based on ViT to address the issues of varying defect sizes and class imbalance in metal defect detection. The model employs convolution with receptive field attention and a context broadcasting median module that optimizes the distribution density of attention maps through median pooling, thereby improving defect detection accuracy under complex backgrounds [71]. Jeong et al. combined the ResNet-50 network with the ViT network, utilizing ResNet-50 for feature extraction at both high and low levels of the network while employing ViT to organize contextual information within the model [72].

In the field of SwinT networks, Zhu et al. proposed a detection model, LSwin Transformer. It combines dilated and standard convolutions to enlarge the receptive field and adopts a novel cross-window strategy for computing attention [73]. Huan Liu et al. integrated a dual-branch network with the SwinT, utilizing spatial and channel attention mechanisms, as well as bilinear interpolation to enhance edge detection. This approach improves the extraction of global semantic information for steel strip surface defect detection [74]. Additionally, Huang et al. incorporated Swin-T into the neck module of YOLO to enhance the model’s capability for understanding deep semantic features. They also embedded the SE module into YOLO’s CSP module to strengthen channel information and further refined the FPN structure to better fuse multi-scale information [75].

In the field of DETR, Xia et al. designed a foreground supervision module in DETR to progressively extract foreground features based on feature scores. They proposed a cascaded hybrid matching strategy to increase the number of positive samples and avoid Non-Maximum Suppression post-processing [76]. Mao et al. combined the lightweight MobileNetV3 with a multi-path feature fusion module based on the Real-Time Detection Transformer, significantly reducing the model’s parameter size and computational complexity. Moreover, they proposed an improved MPDIoU loss function to improve bounding box regression accuracy [77]. In the aforementioned improvements based on transformer models, Wei et al. [71] improved the model’s attention mechanism, thereby enhancing its ability to understand features; Jeong et al. [72] introduced a residual network and modified the window shifting strategy to strengthen the model’s feature extraction capability; Xia et al. [76] and Mao et al. [77] optimized the network’s post-processing. These transformer-based classification methods have achieved considerable success on a dataset for metal surface defect detection.

Table 3 presents the information on the improved transformer-based models. It can be observed that most improvements focus on optimizing the attention mechanism and improving information extraction capability. Related studies enhance attention modules, refine window partition strategies, and improve feature extraction mechanisms to strengthen the model’s ability to capture complex defects. These efforts highlight the potential value of transformer models in attention-based modeling.

However, transformer-based models face such problems as high computational costs and slow detection speeds. Although some lightweight designs have achieved improved inference efficiency, there is some room to improve accuracy [77]. Overall, future research on transformer-based models should balance modeling capacity and computational efficiency to meet the requirements of industrial deployment.

3. Image Processing Techniques

In the field of metal surface defect detection, different detection targets in an image have different physical sizes. In order to help the network better handle the detection targets with different sizes, Pyramid Networks have been widely used in the model for image feature processing [78,79,80,81,82].

Traditional image pyramids usually process images independently at each scale, extracting features that are more basic and may lead to high computational overhead due to the need to repeatedly compute images at different scales [83]. The Feature Pyramid Network (FPN) aims to effectively utilize multi-scale features by combining high-level and low-level features through lateral connections to form a richer feature representation. Its structure is shown in Figure 7.

FPN consists of two parts: (a) a bottom-up pathway from shallow layers to deep layers; and (b) a top-down pathway from deep layers to shallow layers. These two parts are linked by lateral connections. In FPN, feature extraction is first performed by the backbone network, and after multiple convolutional layers, the feature map size is gradually reduced, while semantic information is enriched. These features undergo a

1 \times 1

convolutional alteration channel, followed by feature fusion in the top-down network. In a top-down network, the features of the uppermost layer are usually up-sampled by up-sampling methods (e.g., interpolation) to resize the high-level feature map to match the size of the network, and then summed or spliced together to form a new feature map, which contains the semantic information on the higher layers while retaining the details of the lower layers.

FPN can be further integrated with other popular network architectures, such as FPN with YOLO or FPN with transformers. In fact, this has already become a current trend. FPN can be optimized according to different specific scenarios. Currently, many scholars have further improved it in the task of metal surface defect detection. Some improved models of FPN-based networks are summarized in Table 4. Han et al. proposed a pyramid structure for small object detection, SA-FPN. By introducing a multi-scale fusion module with perceptual ability, they guided the model to solve the problem of information loss for small surface defects during the model filtering process [84]. To enhance the model’s ability to handle various metal surface defects of different sizes, Du et al. designed an adaptive focusing PN (Foc-FPN) [85]. They input multi-level features into the model and made it focus on the spatial information connections between defects. Zhou et al. proposed a CM-FPN based on the attention mechanism [86]. In this structure, both global attention and channel attention were considered simultaneously, enabling the model to focus on the parts rich in important information among different layers. While considering the model’s speed, zhu et al. introduced a feature fusion block into the pyramid structure to extract information between different layers and thus conduct interactions between different layers [87]. In the above improvements to the pyramid model, the mutual connections between different layers were all emphasized. This is also an advantage of the pyramid model structure, which can better integrate the model’s information and thus achieve information interactions.

Table 4 presents the information on the improved FPN modules. It can be observed that most improvements based on FPN focus on multi-scale feature fusion. The related studies listed in the Table 4 optimize the pyramid information structure, enhance feature attention, and integrate multi-scale feature representation mechanisms to improve defect recognition performance.

This trend indicates that multi-scale feature representation based on FPN has clear advantages in metal surface defect detection. It is particularly effective in scenarios with significant scale variation. However, its use tends to increase model complexity. Therefore, its induced changes in inference efficiency should also be carefully considered.

4. Metal Surface Defect Dataset

In the field of metal surface defect detection, several publicly available datasets can be used for research and applications.

The NEU-DET Surface Defect Database, created by Northeastern University [88], contains surface defect scenarios of steel strips, covering six different types of defects: scale, spots, cracks, pitting, and dents. Each defect type has 300 samples, totaling 1800 images in this dataset. The sample in this dataset shows surface defects in Figure 8. However, the intra-class defects exhibit significant variation, while the inter-class defects are overly similar, posing challenges for their accurate classification.

The GC10-DET Dataset focuses on surface defects of steel plates [89], featuring ten main types of defects, including punch holes, weld seams, and crescent-shaped gaps, all collected from real-world scenarios. The schematic diagram of the surface defects is shown in Figure 9. After reorganization, the dataset contains 2294 grayscale images. It was compiled by Lv and others from Tianjin University.

The Tianchi Aluminum Surface Defect Dataset is an aluminum surface defect dataset released during the Tianchi Data Competition [90]. It is also sourced from real industrial scenarios, utilizing images from leading companies in the aluminum profile sector. The competition provided a total of ten thousand defective pictures with flaws, including defects such as paint blisters, dirt spots, and scratches, providing important references for aluminum processing and quality control. Some samples of the surface defects are shown in Figure 10. The dataset exhibits an imbalanced data distribution, and there is considerable variation in defect specifications across categories, which adds difficulty to the classification task.

The X-SDD dataset is a surface defect dataset for hot-rolled steel strips, which contains a total of 1360 defective images [91]. In the X-SDD dataset, the resolution of each defective image is 128 × 128 pixels, and it includes seven different types of surface defects, such as red scale, scale ash, surface scratches, etc. Some samples of the surface defects are shown in Figure 11. Compared with the classic NEU-DET defect dataset, X-SDD provides more types of defects and is suitable for application in the field of industrial defect detection.

The RSDDs dataset is a dataset about the surface defects of railway tracks, it is marked by professional inspectors, providing a reliable foundation for detection [92]. The surface defects in the RSDDs dataset are divided into Type I and Type II, covering a total of two different sources of railway track data collection: The Type I RSDDs dataset is collected from the fast lane, and there are 67 images in this subset; the Type II RSDDs dataset is collected from ordinary or heavy-duty transportation tracks, which contains 128 images. Examples of the surface defects are shown in Figure 12.

Table 5 lists the detailed information on the above-mentioned datasets, including data source, defect type, image size, number of images, number of classes, annotation type, imbalance ratio, link, and related research. This information should help readers better understand these datasets and use them effectively.

Table 6 presents the typical mAP ranges for the YOLO, R-CNN, and transformer series methods on three mainstream industrial defect detection datasets: NEU-DET, GC10-DET, and RSDDs. All performance values in the table represent approximate typical ranges reported for each type of method on the corresponding dataset. Due to differences in network backbones, input image sizes, and training strategies used in different studies, the results across different method categories and within the same category are not strictly comparable. They are provided only to reflect the overall performance level of each type of method.

The five metal surface defect datasets recommended in this paper are all based on industrial scenarios. Compared with academic datasets, industrial datasets often show a clear long-tail distribution. Some defect categories contain only a small number of samples, and multiple defect types may appear in a single image. In contrast, academic datasets usually have relatively balanced categories and sufficient samples. In terms of image acquisition, academic datasets are often collected under controlled lighting and shooting conditions, with limited interference. In real industrial environments, data collection is often affected by reflection, noise, and complex texture interference. As a result, industrial datasets present greater diversity in image content and quality.

At the annotation level, labeling in industrial datasets is often influenced by subjective judgment. Academic datasets are usually reviewed through standardized manual verification procedures. This makes defect detection on industrial datasets more complex and increases the difficulty of a detection task. Differences in statistical distributions and data-acquisition mechanisms create a gap between academic and industrial datasets.

For small defect detection, GC10-DET and the Tianchi Aluminum Surface Defect Dataset are recommended industrial datasets. Compared with other datasets, these two datasets contain high-resolution images collected from real industrial environments. The defect scales cover a broader range, and some defects occupy only a very small proportion of the entire image. In GC10-DET, defect categories include small targets such as a rolled pit. Compared with NEU-DET, GC10-DET is more challenging in terms of category diversity and variations in the defect scale. It can be used to effectively evaluate a model’s ability to detect tiny defects. The Tianchi Aluminum Surface Defect Dataset also contains many small defects, such as stains and dents. The background is complex and includes various interference factors. It is closer to real industrial inspection scenarios and can be used to effectively assess the practical performance of a model in complex real-world environments.

5. Evaluation Metrics

In the field of metal surface defect detection, there are many metrics available to evaluate the performance of a model and assess specific detection scenarios to use, like Precision [103], Recall [104], Average Precision [105], Mean Average Precision [106], and Frames Per Second [107].

Precision (P) is a key metric for assessing the performance of classification models. It assesses the model’s accuracy in predicting positive classes. Precision is defined as the ratio of the number of true positive samples correctly predicted as positive to the total number of samples predicted as positive, i.e.,

P = \frac{T P}{T P + F P}

(1)

where

T P

represents the number of true positive samples correctly predicted as positive, and

F P

represents the number of false positive samples incorrectly predicted as positive. Precision effectively reflects the accuracy of the model in predicting the correct samples. It is particularly important in applications where reducing false positives is critical. However, precision does not take into account the false negatives. Therefore, focusing solely on precision in imbalanced datasets may lead to a misjudgment of model performance.

Recall (R) is primarily used to measure the model’s ability to identify positive samples. It is defined as the ratio of the number of true positive samples correctly identified by the model to the total number of actual positive samples, which includes TP and false negatives, i.e.,

R = \frac{T P}{T P + F N}

(2)

where

T P

represents the number of true positive samples correctly identified as positive, and

F N

represents the number of false negative samples that are not identified as positive. The advantage of recall is that it emphasizes the model’s comprehensiveness in identifying positive samples, making it suitable for scenarios where the cost of missing detections is high. However, the drawback is that it may ignore the false positive cases, which leads to the need for combining precision and recall to comprehensively evaluate model performance in practical applications.

Average precision (AP) is a fundamental and well-defined metric for assessing the accuracy of metal surface defect detection. It serves as an intuitive measure of a model’s ability to detect a specific category of objects, where higher AP values indicate better detection performance for that category. Specifically, AP is defined as the area under the precision–recall curve and is typically computed using standard numerical integration or approximation methods. The calculation process is as follows:

A P = \int_{0}^{1} p (r) d r

(3)

The Mean Average Precision (mAP) is a commonly used evaluation metric in object detection tasks, designed to quantify the precision and recall of a model in detecting objects. mAP is calculated as the arithmetic mean of the AP. For C-class detection tasks, we define:

m A P = \frac{1}{C} \sum_{i = 1}^{C} {A P}_{i}

(4)

where

{AP}_{i}

represents the average precision of the i-th class.

In object detection, mAP comprehensively reflects a model’s overall performance in object localization and recognition, providing a critical reference for model selection and optimization.

Frames Per Second (FPS) is a metric used to evaluate the real-time performance of a detection model. It represents the number of image frames that the model can process per unit time. A higher FPS indicates faster inference speed and stronger real-time capability. FPS is typically defined as the number of images processed within a given time period, i.e.,

F P S = \frac{N_{t o t a l}}{T_{t o t a l}}

(5)

where

N_{t o t a l}

denotes the total number of processed images, and

T_{t o t a l}

denotes the total time taken to complete the inference of these images.

In Table 7, we present a statistical analysis of six metal surface defect detection networks on NEU-DET based on different backbone models. The reported metrics include mAP, number of parameters, and FPS. We also provide the input image size used during training, the dataset split, and whether data augmentation was applied. The evaluated models include the Improved Dilated Neighborhood Attention Transformer (DINAT), a Fast R-CNN model with Swin Transformer by Li et al. [108], and Faster R-CNN, YOLOv5s, and Improved YOLO models designed and validated by Yang et al. [109]. Due to variations in training settings across different studies, the reported results are intended to reflect overall performance ranges.

Among these models, Improved DINAT achieves the highest mAP of 83.7%, which reflects the strong recognition capability of transformer-based models. The Improved YOLO model achieves the highest FPS of 70.2, maintaining the high detection speed of the YOLO series. In terms of parameters, either of YOLOv5s and Improved YOLO model haves about 7 million parameters. This is significantly fewer than those of the transformer-based and R-CNN-based models that have more than 60M parameters [108].

6. Comparative Critical Analysis of Detection Paradigms

6.1. Performance and Trade-Off Comparison of YOLO, R-CNN, and DETR

In the field of metal surface defect detection, the engineering application of detection models is often affected by their detection accuracy, real-time performance, deployment requirements, and other related factors. YOLO, R-CNN, and DETR in the transformer framework represent three widely used detection paradigms. Due to their different architectural design principles, they exhibit clear differences in detecting small-scale, irregular, and complex-textured background defects.

The existing studies mainly focus on model structures and performance reports. They seldom explore the intrinsic relationship between paradigm characteristics and defect properties. This section provides a structured analysis of the three paradigms from the perspectives of performance range, task adaptability, and other relevant aspects. It further examines the coupling relationship between these detection paradigms and metal surface defect detection. The aim is to clarify the applicability boundaries and optimization directions of different paradigms.

Table 8 presents a detailed performance comparison of different detection paradigms. From the perspective of mAP performance, the DETR series achieves higher detection accuracy among the three paradigms. This advantage is attributed to its global attention mechanism, which helps capture weak features and distinguish defects under complex backgrounds. In terms of model size, the YOLO series models are relatively small. This property is beneficial for fast training and convenient deployment. In contrast, the R-CNN and DETR series adopt more complex architectures and contain more parameters. Regarding inference speed, the DETR series operates at a moderate FPS due to the computational cost of its attention mechanism. The R-CNN series is affected by its two-stage design and thus has lower FPS than YOLO models.

From the perspective of sensitivity to small defects, earlier YOLO models showed moderate performance on small object detection [110]. They relied on multi-scale feature fusion to improve the perception of small targets. In recent versions, the YOLO series has achieved clear improvements in this aspect. The R-CNN series enhances local feature representation through region cropping. This strategy is beneficial for small defect localization to some extent. However, its recall performance depends on the quality of region proposal generation. The Transformer-based series performs global modeling through self-attention. It demonstrates a strong representation capability under weak features and complex backgrounds, and it can detect small defects through a global attention mechanism.

In terms of deployment costs, the YOLO series adopts a relatively lightweight architecture and provides fast inference speed. It is more suitable for industrial online inspection and deployment on edge devices. The R-CNN series introduces additional computational overhead due to its two-stage design, which leads to a higher deployment cost. The transformer-based series has medium to high computational complexity and requires considerable computing power and memory resources.

In terms of core modeling strategies, YOLO combines end-to-end dense prediction with convolution-based local modeling. The R-CNN series is built on a two-stage framework, which includes region proposal generation and local feature extraction. The transformer-based series performs overall modeling through anchor-free prediction and a global attention mechanism. The structural differences among these three paradigms has largely determined their applicability boundaries and engineering feasibility in metal surface defect detection.

6.2. Coupling Analysis of Detection Paradigms and Defect Features

In metal surface defect detection tasks, defects are often small in scale and irregular in shape. The detection process also faces complex textured backgrounds. In addition, metal surface defects are usually rare. As a result, available defect samples are often limited in size. For small datasets, different detection paradigms exhibit distinct representation capabilities across defect types.

In metal surface defect detection, small defects are common. These small-scale defects occupy limited regions in images and contain weak feature information. During feature extraction and sampling, their representations are easily suppressed, which makes them more difficult to detect. In recent versions, the YOLO series introduces a feature pyramid structure to improve small-object detection. Through this structure, shallow high-resolution features participate in prediction, which enhances the perception of small targets. The Feature Pyramid Network (FPN) plays an important role in multi-scale feature fusion. Through top-down information flow and cross-layer feature integration, FPN enriches high-resolution feature maps with deep semantic information. This process improves the representation ability for small objects and alleviates the limitation of YOLO in detecting small defects. The R-CNN series applies a region proposal mechanism and performs region cropping on candidate areas. This strategy increases the proportion of small defects within local regions and strengthens localization performance. However, the region proposal process depends on feature map resolution and scale design. When the defect size approaches the down-sampling limit, the recall rate tends to decrease. The DETR series adopts an anchor-free prediction strategy and introduces multi-scale attention mechanisms. It dynamically aggregates features at different scales. This design eases the scale constraints caused by anchor boxes and shows certain advantages in recognizing small-scale defects.

In metal surface defect detection tasks, defects such as cracks and dents often show significant shape variations and irregular boundaries. Both anchor-based YOLO and R-CNN series rely on rectangular bounding boxes. This constraint may lead to fitting bias for complex shapes. In scenarios with slender or curved cracks, localization errors may occur. Although FPN improves multi-scale feature representation, it does not remove the geometric limitation of anchor boxes. Therefore, the adaptability to highly irregular targets is still restricted by the structural design. The DETR series adopts an anchor-free prediction strategy and establishes global feature relationships through self-attention. This mechanism allows the model to capture object shape characteristics over a wider spatial range. It shows structural advantages when handling defects with large shape variations.

Industrial metal surfaces often contain obvious texture patterns and background noise. In some cases, the difference between defect features and the background is small. This situation may lead to missed detection and false positives. YOLO and R-CNN series rely on convolution to extract local features. Their receptive fields expand layer by layer. Given strong textured backgrounds, their ability to distinguish defects from the background is limited. FPN enhances feature representation by fusing semantic information from different levels. This design improves robustness to interference. However, it still follows a local convolution paradigm. The DETR series establishes cross-region feature modeling through a global attention mechanism. It strengthens defect-related features and suppresses background interference at a global scale. In complex texture or low-contrast scenarios, its feature representation is relatively stable.

When constructing datasets for metal surface defect detection, high data acquisition costs and annotation difficulty are common challenges. As a result, the available dataset size is often limited. The convolution-based R-CNN and YOLO models have a strong structural inductive bias. Through shared convolutional kernels, the models can learn stable feature representations. This property is beneficial under small-data conditions and can help the training process converge more reliably. In addition, FPN strengthens multi-scale features through cross-layer fusion. When information is limited, low-level features can provide useful spatial details. In contrast, the Transformer-based DETR series relies on self-attention for comprehensive modeling. The self-attention mechanism requires large-scale data to learn effective parameter representations. It establishes spatial and feature relationships mainly through data-driven learning, rather than structural constraints. When the dataset size is limited, this characteristic may lead to overfitting or slow convergence.

7. Future Work

The detection of metal surface defects in metals is an indispensable aspect of the manufacturing industry. With the continuous development of technology, there are now many solutions for the detection of metal surface defects. However, when dealing with some relatively complex surface defects, there is still a great deal of room for improvement in the classification performance of existing models. Compared with purely technical evolution, the current field requires analysis that starts from the fundamental nature of the problem. It is necessary to conduct a systematic examination of the key challenges and to develop future research directions around these core issues.

Based on existing studies and industrial practice, metal surface defect detection faces several key challenges. These include (1) multi-scale variation and the difficulty of small object detection and (2) the trade-off between detection accuracy and real-time performance, data scarcity and class imbalance, model complexity and computational efficiency, model interpretability, and domain shift. These issues represent important research topics in the field of metal surface defect detection. A systematic discussion of these challenges is necessary to promote the continuous development of related technologies.

7.1. Multi-Scale Variation and Difficulty of Small Object Detection

Compared with other detection tasks, metal surface defects often show significant variations in size. This characteristic challenges the ability of models to detect small objects and handle multi-scale defects. The studies on YOLO handling industrial scenarios focus on multi-scale feature fusion [111,112]. The results indicate that, when detecting datasets with large size differences and small objects, YOLO has room for improvement in multi-scale modeling. To address this issue, deformable convolution can be introduced to adaptively extract features at different scales, thereby enhancing a YOLO model’s adaptability to scale variation.

In addition, to handle significant size variation and small object detection, most studies adopt the FPN structure to further improve the three mainstream models, i.e., YOLO, R-CNN and transformer. FPN integrates multi-scale information at each level and fuses features of different resolutions, including small-scale features. This design enables the network to process defects of various sizes more effectively. Future research can further explore improvements and applications of the FPN structure in different detection models.

7.2. Trade-Off Between Detection Accuracy and Real-Time Performance

Since most metal surface defect detection tasks arise from industrial environments, the trade-off between detection accuracy and inference speed becomes a key issue. A YOLO model adopts a one-stage detection framework and formulates prediction as a regression problem. This design improves inference speed, but its accuracy is often lower than that of two-stage methods. Therefore, further research can focus on improving the accuracy of the YOLO series. For example, effective attention mechanisms can be introduced into it;, and hybrid architectures that combine YOLO with other frameworks can be designed.

In contrast, the R-CNN series generally achieves high accuracy, especially in complex background conditions. However, the complex processing of region proposals increases model complexity and reduces inference speeds. Its real-time performance is weaker than that of YOLO. In addition, R-CNN heavily depends on the quality of region proposals. Future research needs to focus on improving its detection accuracy while maintaining real-time performance. Possible directions include refining the region proposal module or constructing hybrid structures that integrate one-stage and two-stage designs to achieve a better performance and speed balance.

The transformer-based series performs global modeling through self-attention and shows strong accuracy potential for defect detection subject to complex backgrounds. However, its computational complexity is high, and the inference cost on high-resolution industrial images is considerable. This limitation restricts its real-time deployment. Therefore, reducing its computational cost and developing its lightweight version through, e.g., knowledge distillation [113,114,115], are key directions for balancing accuracy and real-time performance in transformer-based models.

7.3. Data Scarcity and Class Imbalance

In practical metal manufacturing environments, the number of defect samples is much smaller than that of normal samples. There is also a clear imbalance among different defect categories. This defect data scarcity limits the generalization ability of deep learning models and increases the risk of overfitting.

To address data scarcity and class imbalance in metal surface defect detection, future research can explore self-supervised learning and transfer learning strategies [116,117]. By pretraining on a large number of unlabeled industrial images, models can learn more general feature representations and reduce the dependence on manually annotated data. In addition, data augmentation methods based on generative models can synthesize diverse defect samples [118,119]. This approach helps alleviate class imbalance and improves the recognition ability for minority defect categories [120,121].

When defect samples are extremely limited, unsupervised anomaly detection methods are of great importance [122,123]. These methods do not require extensive defect annotations. They first learn the feature distribution of normal samples. Regions that significantly deviate from the learned normal pattern are then identified as potential defects. In this way, the reliance on labeled defect data can be effectively mitigated. Future research can further improve anomaly decision mechanisms, for example, by refining feature difference measurement or enhancing feature representation capability, in order to improve detection stability and practical performance.

7.4. Model Complexity and Computational Efficiency

In YOLO, R-CNN, and transformer-based models, the transformer architecture processes all input tokens through global attention. It considers the complete input information during computation, which results in relatively high model complexity. Therefore, improving the efficiency of transformer-based models has become an important research issue.

In recent years, some studies have attempted to combine CNN and transformer architectures [124,125]. CNN models are effective in extracting local features and require lower computational cost. By integrating CNN and transformer techniques, it is possible to reduce the excessive complexity caused by pure transformer models.

The Mamba model is a novel state space model. Compared with the quadratic computational complexity of the transformer series, Mamba has linear computational complexity. It achieves efficient feature representation through selective information modeling [126]. In the field of computer vision, Zhu et al. proposed a Mamba model for image processing, namely Vision Mamba (ViM) [127]. In the ViM model, the image is first divided into patches and projected into tokens, which are then fed into the ViM encoder. Its bidirectional neural network module is used to enhance its ability to understand the context of an image. Future research should consider using the Mamba model to deal with defects in the field of metal surface defect detection, or to explore the hybrid architectures of the Mamba model and other models.

7.5. Model Interpretability

In the field of metal surface defect detection, the application of deep learning has improved detection accuracy and efficiency. However, it also introduces challenges related to model interpretability. Deep learning models are often regarded as black-box systems [128,129]. Although they can provide detection results, their internal decision process is difficult to understand directly. In metal surface defect detection, a model may identify a region as defective but cannot clearly explain the specific basis of its judgment. This uncertainty may introduce risks in industrial applications. Misjudgment can affect a production process. Therefore, in practical deployment, it is necessary not only to focus on detection accuracy but also to enhance the interpretability of model decisions to achieve a clearer and explainable detection process.

To improve interpretability, attention visualization and feature response analysis can be applied during detection. These methods present the regions emphasized by the model in a visual form. For example, heat maps can be generated to observe whether the model mainly focuses on defect regions or is influenced by background interference. Such visualization helps engineers evaluate whether the model’s decisions are reasonable. It can also support the further optimization of training strategies and improve model stability. Future research may explore simpler inspection tools to increase the transparency and reliability of deep learning models.

7.6. Domain Shift and Cross-Scenario Generalization

During the collection of metal surface defects, variations in production lines, lighting conditions, and material types may cause changes in the distribution of defect images [115]. When a model is trained on one dataset and applied to a different environment, its detection performance may decline significantly. At present, many studies focus on validation using only a limited number of datasets. Some datasets, such as the Tianchi aluminum surface defect dataset and X-SDD, are used less frequently. This concentration on specific datasets may lead to over-optimization within a limited data distribution and a lack of cross-dataset and multi-scenario validation. When such models are deployed in new metal surface defect detection tasks, their performance may be affected.

Future research should strengthen cross-domain generalization from several aspects. First, model evaluation should not rely on a single dataset. Comprehensive validation across multiple datasets is necessary. Second, domain adaptation methods can be explored to reduce distribution differences among datasets and improve model performance in unseen environments [130]. In addition, datasets that include diverse types of metal surface defects need to be constructed and used in model training so as to improve model adaptability. Multi-scenario training and the validation of deep learning models should be conducted to enhance their cross-domain robustness.

8. Conclusions

Accurate and fast metal surface defect detection is required to achieve the highly desired quality and structural safety for numerous industrial products. It is also a key technology for promoting intelligent manufacturing and automated quality control. With the rapid development of industrial automation, traditional manual inspection and conventional machine learning methods can no longer meet the requirements for high accuracy and real-time performance in complex scenarios. In recent years, the introduction of deep learning into this area has greatly advanced the metal surface defect detection technology.

This review has systematically analyzed the current applications of mainstream deep learning frameworks to metal surface defect detection. These frameworks include CNN-based models, such as the YOLO series and R-CNN, as well as detection models based on the transformer architecture. The YOLO models, which follow a one-stage detection strategy, show clear advantages in real-time detection and industrial deployment. They can achieve efficient inference in resource-limited environments. The R-CNN series, which adopts a two-stage strategy, provides high localization accuracy under complex background conditions. Transformer-based models enhance feature representation by modeling global dependencies. They show strong potential in coping with complex defect scenarios. In addition, the Feature Pyramid Network, which enables multi-scale feature fusion, plays a key role in handling defects of different sizes. It is an important structural component for improving the detection performance of models. This paper has also introduced several related metal surface defect datasets. These datasets provide important benchmarks for model validation and performance comparison.

Although deep learning methods have achieved continuous improvements in accuracy for metal surface defect detection in recent years, several challenges remain open when they are put into practical industrial use. Metal surface defects often vary greatly in size and have irregular shapes. This makes multi-scale modeling and small defect recognition difficult. The trade-off between accuracy and real-time performance among different models, as well as the conflict between model complexity and computational efficiency, limits their practical deployment. In addition, data scarcity, class imbalance, and domain shift issues place higher demands on model generalization. Limited model interpretability also affects model transparency and understanding by engineers during detection. Future research should focus on efficient multi-scale modeling, lightweight architecture design, and improved cross-domain robustness of developed models. It is important to enhance interpretability and engineering adaptability while maintaining a fine balance between accuracy and efficiency.

In general, metal surface defect detection is shifting from optimizing the performance of a single model to coordinated optimization in multiple aspects. Through innovations in algorithm design, improvements in multi-scale fusion techniques, and optimization of deployment strategies, it is possible to achieve a balance among detection accuracy, real-time performance, and generalization ability. This progress should provide strong support for reliable quality control in intelligent manufacturing and 3D printing environments [131,132,133].

Author Contributions

Conceptualization, Y.W., M.Z. and C.Z.; methodology, Y.W. and M.Z.; validation, Y.W., M.Z. and C.Z.; formal analysis, Y.W. and C.Z.; investigation, Y.W. and C.Z.; data curation, Y.W.; writing—original draft preparation, Y.W.; writing—review and editing, M.Z. and C.Z.; supervision, M.Z. and K.S.; project administration, M.Z. and K.S.; funding acquisition, M.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by FDCT (Fundo para o Desenvolvimento das Ciencias e da Tecnologia) under Grant No. 0147/2024/AFJ, the National Natural Science Foundation of China under Grant 62461160259, and the Deanship of Scientific Research (DSR) at King Abdulaziz University, Jeddah, Saudi Arabia has funded this project, under grant no. (RG-9-135-43).

Data Availability Statement

In this article, the primary research focus is on metal surface defect detection. The datasets investigated in this study include those described in Section 4, Metal Surface Defect Dataset.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

YOLO	You Only Look Once
PN	Pyramid Network
CNN	Convolutional Neural Network
CBL	Convolutional Block Layer
BN	Batch Normalization
CSP1_X	Cross Stage Partial 1_X
CSP2_X	Cross Stage Partial 2_X
RPN	Region Proposal Network
RoI	Region of Interest
ViT	Vision Transformer
CV	Computer Vision
Q	Query
K	Key
V	Value
SwinT	Swin Transformer
DETR	Detection Transformer
FFNs	Feedforward Neural Networks
FPN	Feature Pyramid Network
P	Precision
R	Recall
AP	Average Precision
mAP	Mean Average Precision
FPS	Frames Per Second
ViM	Vision Mamba

References

Zhang, Y.; Yan, W. Applications of machine learning in metal powder-bed fusion in-process monitoring and control: Status and challenges. J. Intell. Manuf. 2023, 34, 2557–2580. [Google Scholar] [CrossRef]
Kumar, A.; Harsha, S. A systematic literature review of defect detection in railways using machine vision-based inspection methods. Int. J. Transp. Sci. Technol. 2024, 18, 207–226. [Google Scholar] [CrossRef]
Usamentiaga, R.; Lema, D.G.; Pedrayes, O.D.; Garcia, D.F. Automated Surface Defect Detection in Metals: A Comparative Review of Object Detection and Semantic Segmentation Using Deep Learning. IEEE Trans. Ind. Appl. 2022, 58, 4203–4213. [Google Scholar] [CrossRef]
Andrianandrianina Johanesa, T.V.; Equeter, L.; Mahmoudi, S.A. Survey on AI Applications for Product Quality Control and Predictive Maintenance in Industry 4.0. Electronics 2024, 13, 976. [Google Scholar] [CrossRef]
Qiao, Q.; Hu, H.; Ahmad, A.; Wang, K. A Review of Metal Surface Defect Detection Technologies in Industrial Applications. IEEE Access 2025, 13, 48380–48400. [Google Scholar] [CrossRef]
Yeung, C.C.; Lam, K.M. Efficient Fused-Attention Model for Steel Surface Defect Detection. IEEE Trans. Instrum. Meas. 2022, 71, 2510011. [Google Scholar] [CrossRef]
Xu, R.; Hao, R.; Huang, B. Efficient surface defect detection using self-supervised learning strategy and segmentation network. Adv. Eng. Inform. 2022, 52, 101566. [Google Scholar] [CrossRef]
Demir, K.; Ay, M.; Cavas, M.; Demir, F. Automated steel surface defect detection and classification using a new deep learning-based approach. Neural Comput. Appl. 2023, 35, 8389–8406. [Google Scholar] [CrossRef]
Wang, Y.; Lu, Y.; Zhu, F.; Du, G.; Li, Z. High precision classification of hot rolled strip steel surface defects using dual path features and entropy attention fusion. Sci. Rep. 2026, 16, 5351. [Google Scholar] [CrossRef]
Rangdale, S.; Pathak, P.; Potdar, P.; Takawale, P.; Unde, A. A Research Survey on Intelligent Detection Surface Irregularities with Deep Learning. In Proceedings of the 2024 8th International Conference on Computational System and Information Technology for Sustainable Solutions (CSITSS), Bengaluru, India, 7–9 November 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 1–6. [Google Scholar] [CrossRef]
Ibrahim, A.A.M.; Tapamo, J.R. A Survey of Vision-Based Methods for Surface Defects’ Detection and Classification in Steel Products. Informatics 2024, 11, 25. [Google Scholar] [CrossRef]
Ashrafi, S.; Teymouri, S.; Etaati, S.; Khoramdel, J.; Borhani, Y.; Najafi, E. Steel surface defect detection and segmentation using deep neural networks. Results Eng. 2025, 25, 103972. [Google Scholar] [CrossRef]
Kong, L.; Duan, J.; Chen, J.; Zhang, Y.; Zhao, L.; Yu, J. Edge attention-based transformer for metal surface defect segmentation. J. Supercomput. 2026, 82, 110. [Google Scholar] [CrossRef]
Khanam, R.; Hussain, M.; Hill, R.; Allen, P. A Comprehensive Review of Convolutional Neural Networks for Defect Detection in Industrial Applications. IEEE Access 2024, 12, 94250–94295. [Google Scholar] [CrossRef]
Fu, Y.; Downey, A.R.; Yuan, L.; Zhang, T.; Pratt, A.; Balogun, Y. Machine learning algorithms for defect detection in metal laser-based additive manufacturing: A review. J. Manuf. Process. 2022, 75, 693–710. [Google Scholar] [CrossRef]
Zorić, B.; Matić, T.; Hocenski, Ž. Classification of biscuit tiles for defect detection using Fourier transform features. ISA Trans. 2022, 125, 400–414. [Google Scholar] [CrossRef] [PubMed]
Ni, X.; Liu, H.; Ma, Z.; Wang, C.; Liu, J. Detection for Rail Surface Defects via Partitioned Edge Feature. IEEE Trans. Intell. Transp. Syst. 2021, 23, 5806–5822. [Google Scholar] [CrossRef]
Gyimah, N.K.; Girma, A.; Mahmoud, M.N.; Nateghi, S.; Homaifar, A.; Opoku, D. A Robust Completed Local Binary Pattern (RCLBP) for Surface Defect Detection. In Proceedings of the 2021 IEEE International Conference on Systems, Man, and Cybernetics (SMC), Melbourne, VIC, Australia, 17–20 October 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 1927–1934. [Google Scholar] [CrossRef]
Chen, S.; Zhu, X.; Wang, J.; Chen, X.; Yu, F.; Li, S.; Ye, S. Dgfnet: Deformable attention and flow-guided feature learning for fine-grained surface defect segmentation. J. Intell. Manuf. 2025, 1–17. [Google Scholar] [CrossRef]
Chen, C.; Lee, H.; Chen, M. Steel surface defect detection method based on improved YOLOv9. Sci. Rep. 2025, 15, 25098. [Google Scholar] [CrossRef]
Liu, W.; Yuan, W. Supervised Focused Feature Network for Steel Strip Surface Defect Detection. Mathematics 2025, 13, 3285. [Google Scholar] [CrossRef]
Duan, Y.; He, L.; Wang, Z.; Sa, J.; Yang, J.; Chen, X.; Shi, B.; Zhang, Y.; Sun, J. Multiscale diffusion-enhanced attention network for steel surface defect detection in Polysilicon Production. Sci. Rep. 2026, 16, 5307. [Google Scholar] [CrossRef]
Wei, J.; Chen, X.; Liu, Y.; Feng, X. BEDUNet: Efficient Weakly-Supervised Segmentation of Hot-Rolled Steel Defects with Boundary-Enhanced Dynamic Learning. Expert Syst. Appl. 2025, 306, 130939. [Google Scholar] [CrossRef]
Dong, X.; Li, Y.; Fu, L.; Liu, J. Edge-aware interactive refinement network for strip steel surface defects detection. Meas. Sci. Technol. 2025, 36, 016222. [Google Scholar] [CrossRef]
Ameri, R.; Hsu, C.C.; Band, S.S. A systematic review of deep learning approaches for surface defect detection in industrial applications. Eng. Appl. Artif. Intell. 2024, 130, 107717. [Google Scholar] [CrossRef]
Zhang, H.; Fu, W.; Wang, X.; Li, D.; Zhu, D.; Su, X. An efficient model for metal surface defect detection based on attention mechanism and multi-scale feature. J. Supercomput. 2025, 81, 40. [Google Scholar] [CrossRef]
Zhao, B.; Chen, Y.; Jia, X.; Ma, T. Steel surface defect detection algorithm in complex background scenarios. Measurement 2024, 237, 115189. [Google Scholar] [CrossRef]
Li, M.; Wang, H.; Wan, Z. Surface defect detection of steel strips based on improved YOLOv4. Comput. Electr. Eng. 2022, 102, 108208. [Google Scholar] [CrossRef]
Lu, J.; Yu, M.; Liu, J. Lightweight strip steel defect detection algorithm based on improved YOLOv7. Sci. Rep. 2024, 14, 13267. [Google Scholar] [CrossRef]
Sui, T.; Wang, J. DMPDD-Net: An Effective Defect Detection Method for Aluminum Profiles Surface Defect. IEEE Trans. Instrum. Meas. 2024, 74, 3500313. [Google Scholar] [CrossRef]
Duan, H.; Huang, J.; Liu, W.; Shu, F. Defective surface detection based on improved faster R-CNN. In Proceedings of the 2022 IEEE International Conference on Industrial Technology (ICIT), Shanghai, China, 28–31 March 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 1–6. [Google Scholar] [CrossRef]
Akhyar, F.; Liu, Y.; Hsu, C.Y.; Shih, T.K.; Lin, C.Y. FDD: A deep learning–based steel defect detectors. Int. J. Adv. Manuf. Technol. 2023, 126, 1093–1107. [Google Scholar] [CrossRef]
Evstafev, O.; Shavetov, S. Preprocessing Digital Images for Enhanced Detection and Classification of Surface Defects in Cold-Rolled Sheet Metal. In Proceedings of the 2024 10th International Conference on Control, Decision and Information Technologies (CoDIT), Vallette, Malta, 1–4 July 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 1530–1535. [Google Scholar] [CrossRef]
Yang, F.; Huo, J.; Cheng, Z.; Chen, H.; Shi, Y. An improved mask R-CNN micro-crack detection model for the surface of metal structural parts. Sensors 2023, 24, 62. [Google Scholar] [CrossRef]
Komijani, A.; Vafaeinezhad, F.; Khoramdel, J.; Borhani, Y.; Najafi, E. Multi-label Classification of Steel Surface Defects Using Transfer Learning and Vision Transformer. In Proceedings of the 2022 13th International Conference on Information and Knowledge Technology (IKT), Karaj, Iran, 20–22 December 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 1–5. [Google Scholar] [CrossRef]
Cheng, D.J.; Wang, S.; Zhang, H.B.; Sun, Z.Y. A novel framework for low-contrast and random multi-scale blade casting defect detection by an adaptive global dynamic detection transformer. Comput. Ind. 2024, 162, 104138. [Google Scholar] [CrossRef]
Ma, Y.; Yin, J.; Huang, F.; Li, Q. Surface defect inspection of industrial products with object detection deep networks: A systematic review. Artif. Intell. Rev. 2024, 57, 333. [Google Scholar] [CrossRef]
Abbes, W.; Elleuch, J.F.; Sellami, D. Defect-Net: A new CNN model for steel surface defect classification. In Proceedings of the 2024 IEEE 12th International Symposium on Signal, Image, Video and Communications (ISIVC), Marrakech, Morocco, 21–23 May 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 1–5. [Google Scholar] [CrossRef]
Ahmad, H.M.; Rahimi, A. Deep learning methods for object detection in smart manufacturing: A survey. J. Manuf. Syst. 2022, 64, 181–196. [Google Scholar] [CrossRef]
Tang, B.; Chen, L.; Sun, W.; Lin, Z.K. Review of surface defect detection of steel products based on machine vision. IET Image Process. 2023, 17, 303–322. [Google Scholar] [CrossRef]
Mordia, R.; Verma, A.K. Visual techniques for defects detection in steel products: A comparative study. Eng. Fail. Anal. 2022, 134, 106047. [Google Scholar] [CrossRef]
Diwan, T.; Anirudh, G.; Tembhurne, J.V. Object detection using YOLO: Challenges, architectural successors, datasets and applications. Multimed. Tools Appl. 2023, 82, 9243–9275. [Google Scholar] [CrossRef]
Sirisha, U.; Praveen, S.P.; Srinivasu, P.N.; Barsocchi, P.; Bhoi, A.K. Statistical analysis of design aspects of various YOLO-based deep learning models for object detection. Int. J. Comput. Intell. Syst. 2023, 16, 126. [Google Scholar] [CrossRef]
Xu, H.; Han, F.; Zhou, W.; Liu, Y.; Ding, F.; Zhu, J. ESMNet: An enhanced YOLOv7-based approach to detect surface defects in precision metal workpieces. Measurement 2024, 235, 114970. [Google Scholar] [CrossRef]
Wang, J.; Meng, R.; Huang, Y.; Zhou, L.; Huo, L.; Qiao, Z.; Niu, C. Road defect detection based on improved YOLOv8s model. Sci. Rep. 2024, 14, 16758. [Google Scholar] [CrossRef]
Hussain, M. YOLO-v1 to YOLO-v8, the rise of YOLO and its complementary nature toward digital manufacturing and industrial defect detection. Machines 2023, 11, 677. [Google Scholar] [CrossRef]
Tian, Y.; Ye, Q.; Doermann, D. YOLOv12: Attention-Centric Real-Time Object Detectors. arXiv 2025, arXiv:2502.12524. [Google Scholar] [CrossRef]
Lei, M.; Li, S.; Wu, Y.; Hu, H.; Zhou, Y.; Zheng, X.; Ding, G.; Du, S.; Wu, Z.; Gao, Y. YOLOv13: Real-Time Object Detection with Hypergraph-Enhanced Adaptive Visual Perception. arXiv 2025, arXiv:2506.17733. [Google Scholar] [CrossRef]
Zhao, C.; Shu, X.; Yan, X.; Zuo, X.; Zhu, F. RDD-YOLO: A modified YOLO for detection of steel surface defects. Measurement 2023, 214, 112776. [Google Scholar] [CrossRef]
Huang, B.; Ding, Y.; Liu, G.; Tian, G.; Wang, S. ASD-YOLO: An aircraft surface defects detection method using deformable convolution and attention mechanism. Measurement 2024, 238, 115300. [Google Scholar] [CrossRef]
Fan, J.; Wang, M.; Li, B.; Liu, M.; Shen, D. ACD-YOLO: Improved YOLOv5-based method for steel surface defects detection. IET Image Process. 2024, 18, 761–771. [Google Scholar] [CrossRef]
Xie, Y.; Hu, W.; Xie, S.; He, L. Surface defect detection algorithm based on feature-enhanced YOLO. Cogn. Comput. 2023, 15, 565–579. [Google Scholar] [CrossRef]
Gao, S.; Chu, M.; Zhang, L. A detection network for small defects of steel surface based on YOLOv7. Digit. Signal Process. 2024, 149, 104484. [Google Scholar] [CrossRef]
Selamet, F.; Cakar, S.; Kotan, M. Automatic detection and classification of defective areas on metal parts by using adaptive fusion of faster R-CNN and shape from shading. IEEE Access 2022, 10, 126030–126038. [Google Scholar] [CrossRef]
Shen, M.; Cai, F.; Xie, J. A Multi-Scale Defect Detection for Steel Surface Based on Improved Faster R-CNN. In Proceedings of the 2023 China Automation Congress (CAC), Chongqing, China, 17–19 November 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 7448–7453. [Google Scholar] [CrossRef]
Ye, X.; Ye, J.; He, Z.; Zhang, D.; Hu, X.; Chen, Q. A Novel Defect Detection Method for Ferrite Shield Surface Defects by Improved Faster R-CNN. Sci. Program. 2022, 2022, 5695243. [Google Scholar] [CrossRef]
Wang, H.; Li, M.; Wan, Z. Rail surface defect detection based on improved Mask R-CNN. Comput. Electr. Eng. 2022, 102, 108269. [Google Scholar] [CrossRef]
Song, C.; Chen, J.; Lu, Z.; Li, F.; Liu, Y. Steel Surface Defect Detection via Deformable Convolution and Background Suppression. IEEE Trans. Instrum. Meas. 2023, 72, 5017709. [Google Scholar] [CrossRef]
Ye, Q.; Dong, Y.; Zhang, X.; Zhang, D.; Wang, S. Robustness defect detection: Improving the performance of surface defect detection in interference environment. Opt. Lasers Eng. 2024, 175, 108035. [Google Scholar] [CrossRef]
Patwardhan, N.; Marrone, S.; Sansone, C. Transformers in the Real World: A Survey on NLP Applications. Information 2023, 14, 242. [Google Scholar] [CrossRef]
Rahali, A.; Akhloufi, M.A. End-to-end transformer-based models in textual-based NLP. AI 2023, 4, 54–110. [Google Scholar] [CrossRef]
Islam, S.; Elmekki, H.; Elsebai, A.; Bentahar, J.; Drawel, N.; Rjoub, G.; Pedrycz, W. A comprehensive survey on applications of transformers for deep learning tasks. Expert Syst. Appl. 2024, 241, 122666. [Google Scholar] [CrossRef]
Ameri, R.; Hsu, C.C. LACTNet: A Label-Aware CNN-transformer network for surface defect segmentation. Expert Syst. Appl. 2026, 310, 131314. [Google Scholar] [CrossRef]
Wu, S.; Yang, H.; Liao, L.; Song, C.; Fang, Y.; Fu, J.; Li, T. DSAT: A dynamic sparse attention transformer for steel surface defect detection with hierarchical feature fusion. Sci. Rep. 2025, 15, 29198. [Google Scholar] [CrossRef] [PubMed]
Ling, J.; Tan, C.; Cui, L. Metal Surface Defect Detection Based on Transformer Merging Edge Information. In Proceedings of the 2025 11th International Symposium on System Security, Safety, and Reliability (ISSSR), Anshun, China, 12–13 April 2025; IEEE: Piscataway, NJ, USA, 2025; pp. 96–100. [Google Scholar] [CrossRef]
Vasan, V.; Sridharan, N.V.; Vaithiyanathan, S.; Aghaei, M. Detection and classification of surface defects on hot-rolled steel using vision transformers. Heliyon 2024, 10, e38498. [Google Scholar] [CrossRef] [PubMed]
Lin, T.; Wang, Y.; Liu, X.; Qiu, X. A Survey of Transformers. AI Open 2022, 3, 111–132. [Google Scholar] [CrossRef]
Reedha, R.; Dericquebourg, E.; Canals, R.; Hafiane, A. Transformer neural network for weed and crop classification of high resolution UAV images. Remote Sens. 2022, 14, 592. [Google Scholar] [CrossRef]
Li, Y.; Xiang, Y.; Guo, H.; Liu, P.; Liu, C. Swin Transformer Combined with Convolution Neural Network for Surface Defect Detection. Machines 2022, 10, 1083. [Google Scholar] [CrossRef]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Cham, Switzerland, 2020; pp. 213–229. [Google Scholar] [CrossRef]
Wei, H.; Zhao, L.; Li, R.; Zhang, M. RFAConv-CBM-ViT: Enhanced vision transformer for metal surface defect detection. J. Supercomput. 2025, 81, 155. [Google Scholar] [CrossRef]
Jeong, M.; Yang, M.; Jeong, J. Hybrid-DC: A Hybrid Framework Using ResNet-50 and Vision Transformer for Steel Surface Defect Classification in the Rolling Process. Electronics 2024, 13, 4467. [Google Scholar] [CrossRef]
Zhu, W.; Zhang, H.; Zhang, C.; Zhu, X.; Guan, Z.; Jia, J. Surface defect detection and classification of steel using an efficient Swin Transformer. Adv. Eng. Inform. 2023, 57, 102061. [Google Scholar] [CrossRef]
Liu, H.; Chen, C.; Hu, R.; Bin, J.; Dong, H.; Liu, Z. CGTD-Net: Channel-wise global Transformer based dual-branch network for industrial strip steel surface defect detection. IEEE Sens. J. 2024, 24, 4863–4873. [Google Scholar] [CrossRef]
Huang, X.; Zhu, J.; Huo, Y. SSA-YOLO: An improved YOLO for hot-rolled strip steel surface defect detection. IEEE Trans. Instrum. Meas. 2024, 73, 5040017. [Google Scholar] [CrossRef]
Xia, Z.; Zhao, Y.; Gu, J.; Wang, W.; Zhang, W.; Huang, Z. FC-DETR: High-precision end-to-end surface defect detector based on foreground supervision and cascade refined hybrid matching. Expert Syst. Appl. 2024, 266, 126142. [Google Scholar] [CrossRef]
Mao, H.; Gong, Y. Steel surface defect detection based on the lightweight improved RT-DETR algorithm. J. Real-Time Image Process. 2025, 22, 28. [Google Scholar] [CrossRef]
Jiang, X.; Li, Y.; Jiang, T.; Xie, J.; Wu, Y.; Cai, Q.; Jiang, J.; Xu, J.; Zhang, H. RoadFormer: Pyramidal deformable vision transformers for road network extraction with remote sensing images. Int. J. Appl. Earth Obs. Geoinf. 2022, 113, 102987. [Google Scholar] [CrossRef]
Xie, J.; Pang, Y.; Nie, J.; Cao, J.; Han, J. Latent Feature Pyramid Network for Object Detection. IEEE Trans. Multimed. 2022, 25, 2153–2163. [Google Scholar] [CrossRef]
Chan, S.; Li, S.; Zhang, H.; Zhou, X.; Mao, J.; Hong, F. Feature optimization-guided high-precision and real-time metal surface defect detection network. Sci. Rep. 2024, 14, 31941. [Google Scholar] [CrossRef]
Zhang, W.; Geng, X. EMSD-YOLO: A Surface Defect Detection Model Based on Feature Extraction and Fusion. Cogn. Robot. 2026, 6, 92–105. [Google Scholar] [CrossRef]
Fu, Y.; Wang, Y.; Cheng, Z.; Li, Y. Steel surface defect detection based on lightweight network with attention feature fusion and multi-scale detection head. J. Ambient. Intell. Humaniz. Comput. 2026, 17, 401–412. [Google Scholar] [CrossRef]
Zhu, X.; Yang, X.; Wang, Z.; Li, H.; Dou, W.; Ge, J.; Lu, L.; Qiao, Y.; Dai, J. Parameter-Inverted Image Pyramid Networks. arXiv 2024, arXiv:2406.04330. [Google Scholar] [CrossRef]
Han, L.; Li, N.; Li, J.; Gao, B.; Niu, D. SA-FPN: Scale-aware attention-guided feature pyramid network for small object detection on surface defect detection of steel strips. Measurement 2025, 249, 117019. [Google Scholar] [CrossRef]
Du, Y.; Chen, H.; Fu, Y.; Zhu, J.; Zeng, H. AFF-Net: A strip steel surface defect detection network via adaptive focusing features. IEEE Trans. Instrum. Meas. 2024, 73, 2518514. [Google Scholar] [CrossRef]
Zhou, H.; Yang, R.; Hu, R.; Shu, C.; Tang, X.; Li, X. ETDNet: Efficient Transformer-Based Detection Network for Surface Defect Detection. IEEE Trans. Instrum. Meas. 2023, 72, 2525014. [Google Scholar] [CrossRef]
Zhu, J.; Pang, Q.; Li, S.; Tian, S.; Li, J.; Li, Y. ADDet: An efficient multiscale perceptual enhancement network for aluminum defect detection. IEEE Trans. Instrum. Meas. 2023, 73, 5004714. [Google Scholar] [CrossRef]
Song, K.; Yan, Y. NEU Surface Defect Database. 2014. Available online: http://faculty.neu.edu.cn/songkechen/zh_CN/zdylm/263270/list/ (accessed on 5 November 2024).
Lv, X.; Duan, F.; Jiang, J.j.; Fu, X.; Gan, L. Deep Metallic Surface Defect Detection: The New Benchmark and Detection Network. Sensors 2020, 20, 1562. [Google Scholar] [CrossRef]
Tenchi Aluminum Surface Defects. 2018. Available online: https://tianchi.aliyun.com/competition/entrance/231682/information (accessed on 17 March 2026).
Feng, X.; Gao, X.; Luo, L. X-SDD: A new benchmark for hot rolled steel strip surface defects detection. Symmetry 2021, 13, 706. [Google Scholar] [CrossRef]
Li, G.; Shao, R.; Wan, H.; Zhou, M.; Li, M. A model for surface defect detection of industrial products based on attention augmentation. Comput. Intell. Neurosci. 2022, 2022, 9577096. [Google Scholar] [CrossRef] [PubMed]
Qian, K.; Zou, L.; Wang, Z.; Wang, W. Metallic surface defect recognition network based on global feature aggregation and dual context decoupled head. Appl. Soft Comput. 2024, 158, 111589. [Google Scholar] [CrossRef]
Li, H.; Li, X.; Fan, Q.; Xiong, Q.; Wang, X.; Leung, V.C. Transfer Learning for Real-Time Surface Defect Detection with Multi-Access Edge-Cloud Computing Networks. IEEE Trans. Netw. Serv. Manag. 2023, 21, 310–323. [Google Scholar] [CrossRef]
Xie, W.; Ma, W.; Sun, X. An efficient re-parameterization feature pyramid network on YOLOv8 to the detection of steel surface defect. Neurocomputing 2025, 614, 128775. [Google Scholar] [CrossRef]
Cheng, Z.; Gao, L.; Wang, Y.; Deng, Z.; Tao, Y. EC-YOLO: Effectual Detection Model for Steel Strip Surface Defects Based on YOLO-V5. IEEE Access 2024, 12, 62765–62778. [Google Scholar] [CrossRef]
Zhuang, W.; Zhang, T.; Yao, L.; Lu, Y.; Yuan, P. A Research on Image Semantic Refinement Recognition of Product Surface Defects Based on Causal Knowledge. Appl. Sci. 2022, 12, 8828. [Google Scholar] [CrossRef]
Lv, S.; Hou, Z.; Li, B.; Ni, H.; Shi, W.; Tao, C.; Zhou, L.; Gu, H.; Chen, L. Accurate location detection method for aluminum profile surface defects based on improved YOLOX-S Algorithm. Met. Mater. Int. 2025, 31, 523–536. [Google Scholar] [CrossRef]
Guo, Y.; Wei, J.; Feng, X. TSEDNet: Task-specific encoder–decoder network for surface defects of strip steel. Measurement 2025, 239, 115438. [Google Scholar] [CrossRef]
Wen, L.; Zhang, Y.; Gao, L.; Li, X.; Li, M. A New Multi-Scale Multi-Attention Convolutional Neural Network for Fine-Grained Surface Defect Detection. IEEE Trans. Instrum. Meas. 2023, 72, 5013811. [Google Scholar] [CrossRef]
Xu, Y.; Wang, H.; Liu, Z.; Zuo, M. Self-Supervised Defect Representation Learning for Label-Limited Rail Surface Defect Detection. IEEE Sens. J. 2023, 23, 29235–29246. [Google Scholar] [CrossRef]
Zhang, Z.; Zhang, L. Unsupervised pixel-level detection of rail surface defects using multistep domain adaptation. IEEE Trans. Syst. Man Cybern. Syst. 2023, 53, 5784–5795. [Google Scholar] [CrossRef]
Liu, Y.; Zhang, C.; Dong, X. A survey of real-time surface defect inspection methods based on deep learning. Artif. Intell. Rev. 2023, 56, 12131–12170. [Google Scholar] [CrossRef]
Chen, H.; Du, Y.; Fu, Y.; Zhu, J.; Zeng, H. DCAM-Net: A rapid detection network for strip steel surface defects based on deformable convolution and attention mechanism. IEEE Trans. Instrum. Meas. 2023, 72, 5005312. [Google Scholar] [CrossRef]
Liu, H.; Hu, R.; Dong, H.; Liu, Z. SFC-YOLOv8: Enhanced Strip Steel Surface Defect Detection Using Spatial-Frequency Domain-Optimized YOLOv8. IEEE Trans. Instrum. Meas. 2025, 74, 9700111. [Google Scholar] [CrossRef]
Zhou, C.; Lu, Z.; Lv, Z.; Meng, M.; Tan, Y.; Xia, K.; Liu, K.; Zuo, H. Metal surface defect detection based on improved YOLOv5. Sci. Rep. 2023, 13, 20803. [Google Scholar] [CrossRef]
Guo, B.; Wang, Y.; Zhen, S.; Yu, R.; Su, Z. SPEED: Semantic prior and extremely efficient dilated convolution network for real-time metal surface defects detection. IEEE Trans. Ind. Inform. 2023, 19, 11380–11390. [Google Scholar] [CrossRef]
Li, Y.; Han, Z.; Wang, W.; Xu, H.; Wei, Y.; Zai, G. Steel surface defect detection based on sparse global attention transformer. Pattern Anal. Appl. 2024, 27, 152. [Google Scholar] [CrossRef]
Yang, K.; Chen, T. Lightweight Surface Defect Detection Algorithm Based on Improved YOLOv5. In Proceedings of the 2024 5th International Conference on Mechatronics Technology and Intelligent Manufacturing (ICMTIM), Nanjing, China, 26–28 April 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 798–802. [Google Scholar] [CrossRef]
Cao, Z.; Li, J.; Shao, S.; Zhang, D.; Zhou, M. Siamese Adaptive Network-Based Accurate and Robust Visual Object Tracking Algorithm for Quadrupedal Robots. IEEE Trans. Cybern. 2025, 55, 1264–1276. [Google Scholar] [CrossRef] [PubMed]
Yu, T.; Luo, X.; Li, Q.; Li, L. CRGF-YOLO: An optimized multi-scale feature fusion model based on YOLOv5 for detection of steel surface defects. Int. J. Comput. Intell. Syst. 2024, 17, 154. [Google Scholar] [CrossRef]
Wang, L.; Liu, X.; Ma, J.; Su, W.; Li, H. Real-time steel surface defect detection with improved multi-scale YOLO-v5. Processes 2023, 11, 1357. [Google Scholar] [CrossRef]
Huang, Z.; Yang, S.; Zhou, M.; Li, Z.; Gong, Z.; Chen, Y. Feature map distillation of thin nets for low-resolution object recognition. IEEE Trans. Image Process. 2022, 31, 1364–1379. [Google Scholar] [CrossRef]
Yang, S.; Yang, J.; Zhou, M.; Huang, Z.; Zheng, W.S.; Yang, X.; Ren, J. Learning From Human Educational Wisdom: A Student-Centered Knowledge Distillation Method. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 4188–4205. [Google Scholar] [CrossRef]
Niu, J.; Lin, C.; Cao, Z.; Zhou, M. Learning to Detect Objects Under Inclement Weather Conditions via Symmetric Localization Distillation and Adaptive Label Assignment. IEEE Trans. Ind. Inform. 2026. early access. [Google Scholar] [CrossRef]
Wang, G.; Qiao, J.; Bi, J.; Li, W.; Zhou, M. TL-GDBN: Growing Deep Belief Network With Transfer Learning. IEEE Trans. Autom. Sci. Eng. 2019, 16, 874–885. [Google Scholar] [CrossRef]
Yao, S.; Kang, Q.; Zhou, M.; Rawa, M.J.; Albeshri, A. Discriminative Manifold Distribution Alignment for Domain Adaptation. IEEE Trans. Syst. Man Cybern. Syst. 2023, 53, 1183–1197. [Google Scholar] [CrossRef]
Chen, J.; Xu, Q.; Kang, Q.; Zhou, M. MOGAN: Morphologic-Structure-Aware Generative Learning from a Single Image. IEEE Trans. Syst. Man Cybern. Syst. 2024, 54, 2021–2033. [Google Scholar] [CrossRef]
Zhang, L.; Han, H.; Zhou, M.; Al-Turki, Y.; Abusorrah, A. An Improved Discriminative Model Prediction Approach to Real-Time Tracking of Objects with Camera as Sensors. IEEE Sens. J. 2021, 21, 17308–17317. [Google Scholar] [CrossRef]
Wang, X.; Li, Y.; Gong, Y.; Liu, M.; Luo, K. DS-IIEG: A dual-stage industrial image enhancement and generation framework for robust surface defect detection in manufacturing with limited data. J. Intell. Manuf. 2026, 1–32. [Google Scholar] [CrossRef]
Chen, J.; Cheng, Y.; Wen, G.; Liu, X. BMMC-GAN: A Bi-directional Mapping and Multi-Category Controlled Generative Adversarial Network for Diverse Industrial Defect Images Generation. Neurocomputing 2026, 672, 132859. [Google Scholar] [CrossRef]
Cai, G.; Wang, Y.; He, L.; Zhou, M. Unsupervised Domain Adaptation with Adversarial Residual Transform Networks. IEEE Trans. Neural Netw. Learn. Syst. 2020, 31, 3073–3086. [Google Scholar] [CrossRef]
Liu, H.; Zhang, Q.; Hu, Y.; Zeng, H.; Fan, B. Unsupervised Multi-Expert Learning Model for Underwater Image Enhancement. IEEE/CAA J. Autom. Sin. 2024, 11, 708–722. [Google Scholar] [CrossRef]
Assad, S.; Isa, N.A.M.; Saleh, S.A.M. Hybrid CNN-Transformer Models for Industrial Defect Detection: A Systematic Review. Results Eng. 2026, 29, 109457. [Google Scholar] [CrossRef]
Pan, W.; Zhong, R.; Huang, J.; Li, Y.; Zhang, W.; Liu, T.; Liu, Y. DEENet: An edge-enhanced CNN–Transformer dual-encoder model for steel surface defect detection. Sci. Rep. 2026, 16, 6692. [Google Scholar] [CrossRef]
Gu, A.; Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. arXiv 2023, arXiv:2312.00752. [Google Scholar] [CrossRef]
Zhu, L.; Liao, B.; Zhang, Q.; Wang, X.; Liu, W.; Wang, X. Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model. arXiv 2024, arXiv:2401.09417. [Google Scholar] [CrossRef]
Ghahramani, M.; Zhou, M. Decoding the Black Box: Shedding Light on Manufacturing Processes with Explainable AI. In Proceedings of the 2025 IEEE International Conference on Systems, Man, and Cybernetics (SMC), Vienna, Austria, 5–8 October 2025; IEEE: Piscataway, NJ, USA, 2025; pp. 50–55. [Google Scholar] [CrossRef]
Ghahramani, M.; Zhou, M. CLAIRE: Compressed Latent Autoencoder for Industrial Representation and Evaluation—A Deep Learning Framework for Smart Manufacturing. IEEE Trans. Syst. Man Cybern. Syst. 2026. early access. [Google Scholar] [CrossRef]
Kang, Q.; Yao, S.; Zhou, M.; Zhang, K.; Abusorrah, A. Effective Visual Domain Adaptation via Generative Adversarial Distribution Matching. IEEE Trans. Neural Netw. Learn. Syst. 2021, 32, 3919–3929. [Google Scholar] [CrossRef]
Fang, Q.; Xiong, G.; Zhou, M.; Tamir, T.; Yan, C.; Wu, H.; Shen, Z.; Wang, F. Process Monitoring, Diagnosis and Control of Additive Manufacturing. IEEE Trans. Autom. Sci. Eng. 2024, 21, 1041–1067. [Google Scholar] [CrossRef]
Fang, Q.; Xiong, G.; Zhao, M.; Tamir, T.; Shen, Z.; Yan, C.; Wang, F. Probabilistic Data-Driven Modeling of a Melt Pool in Laser Powder Bed Fusion Additive Manufacturing. IEEE Trans. Autom. Sci. Eng. 2025, 22, 4908–4925. [Google Scholar] [CrossRef]
Ghahramani, M.; Qiao, Y.; Wu, N.; Zhou, M. A Multiobjective Optimization Approach for Feature Selection in Gentelligent Systems. IEEE Internet Things J. 2026, 13, 2451–2461. [Google Scholar] [CrossRef]

Figure 1. Evolution of metal surface defect detection frameworks.

Figure 2. Four modules of a YOLO network framework.

Figure 3. A Faster R-CNN network model.

Figure 4. Schematic diagram of vision-transformer network model.

Figure 5. Transformer encoder.

Figure 6. The structure of DETR.

Figure 7. Feature Pyramid Network.

Figure 8. NEU-DET dataset.

Figure 9. GC10-DET Dataset.

Figure 10. Tianchi Aluminum Profile surface defect dataset.

Figure 11. X-SDD dataset.

Figure 12. RSDDs dataset.

Table 1. Summary of YOLO-based models for defect detection.

Model Name	Defect Type	Improvements	Advantages	Disadvantages
RDD-YOLO [49]	Cracks, inclusions, scratches, and other similar defects	Feature extraction and reuse	Enhanced feature extraction capability	Weak tiny-defect identification
ASD-YOLO [50]	Cracks, dents, paint peeling, scratches, and other similar defects	Multi-scale information capture	Sample imbalance mitigation	Large model size, high computational complexity
ACD-YOLO [51]	Cracks, inclusions, scratches, and other similar defects	Anchor boundary optimization, feature extraction optimization	Speed–accuracy balance, improved small-target detection	Limited generalization ability
FE-YOLO [52]	Steel surface defects, PCB defects	Improved regression loss, K-means++ anchor optimization	Lightweight, fast inference	Parameters need further optimization, requires model distillation
SRN-YOLO [53]	Crazing, punching, weld line, and other similar defects	Backbone output reuse via FPN	Small and blurry defect detection	Unclear boundary recognition weakness

Table 2. Summary of R-CNN-based models for defect detection.

Model Name	Defect Type	Improvements	Advantages	Disadvantages
Improved Faster R-CNN with PANet [55]	Cracks, inclusions, scratches, and other similar defects	Anchor box localization capability enhancement	Multi-scale Target Adaptability, Complex Background Adaptability,	High computational cost
Improved Faster R-CNN [56]	Crack, pits, impurities, dirt	Anchor box multi-scale adaptation capability enhancement	Complex background and diverse defect shape adaptability	Similar shape defect identification confusion
Improved Mask R-CNN [57]	Rail surface defects	New evaluation metric CIoU adoption for candidate box selection	Better candidate frame filtering	Slow detection speed
Improved Faster R-CNN with SF [58]	Cracks, inclusions, scratches, and other similar defects	Complex background suppression for edge information extraction ability enhancement	Anchor box multi-scale adaptation capability enhancement	Sparsely distributed defect recognition capability weakness
A Robust Faster R-CNN [59]	Fracture defects, wear defects, soiling, and other similar defects	Network feature fusion and defect size adaptive capability improvement	Excellent recognition of blurred and distorted defects	Dynamic interference handling improvement need

Table 3. Summary of transformer-based models for defect detection.

Model Name	Defect Type	Improvements	Advantages	Disadvantages
RFAConv-CBM-ViT [71]	Aluminum surface defects, hot-rolled steel strip surface defects	Optimization of attention map density, enhancement of feature extraction and context capture capabilities	Excellent performance in handling highly variable and complex defect types	generalization ability needs further improvement
Hybrid-DC [72]	Cracks, inclusions, scratches, and other similar defects	Feature extraction optimization via residual network integration	High training efficiency	Large model parameters and high computational cost
LSwin Transformer [73]	Cracks, inclusions, scratches, and other similar defects	Model enhancement driven by a novel window shift strategy	Strong generalization ability	Slow Detection Speed
CGTD-Net [74]	Cracks, inclusions, scratches, and other similar defects	Augmentation of network semantic learning capabilities, enhanced attention to spatial and channel information,	Strong detection ability for narrow defects	Requirement for improvement of dependence on hyperparameter selection requirements
SSA-YOLO (with Swin Transformer) [75]	Cracks, inclusions, scratches, and other similar defects	Improvement of global and local information extraction capabilities	High detection accuracy while maintaining fast detection speed	Weak performance in detecting small defects and small sample defect types
FC-DETR [76]	Steel defects, PCB defects	Optimization of foreground defect feature extraction, improvement of positive sample matching accuracy	Detection redundancy avoidance	Weak recognition ability for defects similar to the background
LRT-DETR [77]	Cracks, inclusions, scratches, and other similar defects	Lightweight processing of the network model	High degree of lightweight design, fast detection speed	Poor detection of small objects and irregular defects

Table 4. Summary of FPN-based models for defect detection.

Model Name	Defect Type	Improvements	Advantages
SA-FPN [84]	Punching hole, crescent gap, oil spot, and other defects	Pyramid network layer semantic difference reduction	Small surface defect recognition enhancement,
Foc-FPN [85]	Scratches, rolled-in scale, inclusion, and other similar defects	Multi-scale feature aggregation optimization	Complex defect adaptability improvement
CM-FPN [86]	Scratches, rolled-in scale, inclusion, and other similar defects	Key feature information focus reinforcement	Multi-scale defect recognition capability
PFP [87]	Nonconductive, rubbing, orange peel, and other similar defects	Spatial-channel-scale feature capture integration	Speed-accuracy balance optimization

Table 5. Summary of metal surface defect detection datasets.

Dataset	Data Source	Defect Type	Image Size	Number of Images	Number of Classes	Annotation Type	Imbalance Ratio	Link	Related Research
NEU-DET	Hot-rolled steel strip surface	Hot-rolled steel strip surface defects	$200 \times 200$	1800	6	Bounding box annotations	Low imbalance	http://faculty.neu.edu.cn/songkechen/zh_CN/zdylm/263270/list/, (accessed on 17 March 2026)	[93,94]
GC10-DET	Collected from real industrial production	Steel plate surface defects	$2048 \times 1000$	2294 (after data cleaning)	10	Bounding box annotations	Severe imbalance	https://github.com/lvxiaoming2019/GC10-DET-Metallic-Surface-Defect-Datasets, (accessed on 17 March 2026)	[95,96]
Tianchi Aluminum Surface Defect Dataset	Leading aluminum profile enterprise in Nanhai, Foshan	Aluminum profile surface defects	$2560 \times 1920$	2776	10	Bounding box annotations	Moderate imbalance	https://tianchi.aliyun.com/competition/entrance/231682/information, (accessed on 17 March 2026)	[97,98]
X-SDD	Hot-rolled steel strip surface	Hot-rolled steel strip surface defects	$128 \times 128$	1360	7	Bounding box annotations	Moderate imbalance	https://ieee-dataport.org/documents/x-sdd, (accessed on 17 March 2026)	[99,100]
RSDDs	Rail surface	Rail surface defects	$160 \times 1000$ and $55 \times 1250$	195	At least one defect per image	Bounding box annotations	Low imbalance	https://ieee-dataport.org/documents/rsdds, (accessed on 17 March 2026)	[101,102]

Table 6. mAP (%) Ranges of different method series on multiple datasets.

Method Series	NEU-DET (mAP/%)	GC10-DET (mAP/%)	RSDDs (mAP/%)
YOLO series	76–80	63–66	75–87
R-CNN series	72–83	51–64	72–93
Transformer series	75–82	72–74	75–96

Table 7. Performance comparison of different detection models on NEU-DET.

Metric/Model	Improved DINAT	R-CNN + Transformer	Faster R-CNN	YOLOv5s	YOLOv7	Improved YOLO Model
Architecture	Transformer	R-CNN + Transformer	R-CNN	YOLO	YOLO	YOLO
Params (M)	69.5	66.1	136.79	7.16	37.22	7.1
FPS (Images/s)	29.6	26.7	20.1	64.2	28.8	70.2
mAP (%)	83.7	77.6	75.61	74.24	77.08	81.71
Optimizer	SGD	SGD	Adam	Adam	Adam	Adam
Input Resolution	$200 \times 200$	$200 \times 200$	$200 \times 200$	$200 \times 200$	$200 \times 200$	$200 \times 200$
Data Split (Train:Test:Val)	8:1:1	8:1:1	8:2:0	8:2:0	8:2:0	8:2:0
Data Augmentation	No	No	Mosaic augmentation	Mosaic augmentation	Mosaic augmentation	Mosaic augmentation

Table 8. Comparison of representative detection paradigms.

Evaluation Metric	YOLO Series	R-CNN Series	DETR Series
mAP Level	Medium–High	Medium–High	High
Parameter Scale	Small	Medium–Large	Medium–Large
FPS Ranking	High	Low–Medium	Medium
Sensitivity to Small Defects	Moderate; improved in recent versions	Moderate–Strong; accurate region localization	Strong; global attention for fine-grained feature capture
Deployment Cost	Low	Medium–High	Medium–High
Core Modeling Mechanism	End-to-end detection; local convolutional modeling with FPN-based multi-scale fusion	Two-stage detection; region proposal with local convolutional feature extraction	Transformer-based global attention modeling

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wang, Y.; Zhou, M.; Zhang, C.; Sedraoui, K. Applications of Deep Learning to Metal Surface Defect Detection: Status and Challenges. Processes 2026, 14, 1305. https://doi.org/10.3390/pr14081305

AMA Style

Wang Y, Zhou M, Zhang C, Sedraoui K. Applications of Deep Learning to Metal Surface Defect Detection: Status and Challenges. Processes. 2026; 14(8):1305. https://doi.org/10.3390/pr14081305

Chicago/Turabian Style

Wang, Yizhe, Mengchu Zhou, Chenyang Zhang, and Khaled Sedraoui. 2026. "Applications of Deep Learning to Metal Surface Defect Detection: Status and Challenges" Processes 14, no. 8: 1305. https://doi.org/10.3390/pr14081305

APA Style

Wang, Y., Zhou, M., Zhang, C., & Sedraoui, K. (2026). Applications of Deep Learning to Metal Surface Defect Detection: Status and Challenges. Processes, 14(8), 1305. https://doi.org/10.3390/pr14081305

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Applications of Deep Learning to Metal Surface Defect Detection: Status and Challenges

Abstract

1. Introduction

2. Deep Learning Algorithms

2.1. Single-Stage Detection: YOLO

2.2. Two-Stage Detection: R-CNN

2.3. Transformer-Based Recognition

3. Image Processing Techniques

4. Metal Surface Defect Dataset

5. Evaluation Metrics

6. Comparative Critical Analysis of Detection Paradigms

6.1. Performance and Trade-Off Comparison of YOLO, R-CNN, and DETR

6.2. Coupling Analysis of Detection Paradigms and Defect Features

7. Future Work

7.1. Multi-Scale Variation and Difficulty of Small Object Detection

7.2. Trade-Off Between Detection Accuracy and Real-Time Performance

7.3. Data Scarcity and Class Imbalance

7.4. Model Complexity and Computational Efficiency

7.5. Model Interpretability

7.6. Domain Shift and Cross-Scenario Generalization

8. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI