A Steel Defect Detection Model Enhanced by Pinwheel-Shaped Convolution and Pyramid Sparse Transformer

Gao, Shuangxi; Guo, Xinqi; Wu, Chao; Chen, Miao; Yu, Gui

doi:10.3390/sym17122085

Open AccessArticle

A Steel Defect Detection Model Enhanced by Pinwheel-Shaped Convolution and Pyramid Sparse Transformer

by

Shuangxi Gao

,

Xinqi Guo

,

Chao Wu

,

Miao Chen

and

Gui Yu

^*

School of Mechatronic and Intelligent Manufacturing, Huanggang Normal University, Huanggang 438000, China

^*

Author to whom correspondence should be addressed.

Symmetry 2025, 17(12), 2085; https://doi.org/10.3390/sym17122085

Submission received: 8 November 2025 / Revised: 2 December 2025 / Accepted: 3 December 2025 / Published: 4 December 2025

(This article belongs to the Section Engineering and Materials)

Download

Browse Figures

Versions Notes

Abstract

Steel surface defect detection is critical for ensuring industrial product quality and safety. Although deep learning-based detectors like the YOLO series have demonstrated considerable promise, they often struggle with three key challenges under computational constraints: the anisotropic morphology (i.e., direction-variant shapes) of defects, insufficient modeling of long-range dependencies, and the confusion between signal and noise in feature representation. To address these issues, this paper proposes PSC-YOLO, an enhanced model based on YOLOv11n. Our core design philosophy leverages symmetry principles to guide feature representation and fusion. First, we introduce Pinwheel-shaped Convolution (PConv), whose set of rotationally symmetric kernels explicitly captures multi-directional features to effectively represent anisotropic defects. Second, a Pyramid Sparse Transformer (PST) module is integrated to capture global context via its efficient cross-scale sparse attention, which reduces computational complexity by dynamically focusing on the most relevant features across different scales, leveraging a symmetrical pyramid architecture for balanced multi-scale fusion, thereby overcoming the bottleneck in long-range dependency modeling. Finally, a Channel-Prior Convolutional Attention (CPCA) mechanism is embedded to perform dynamic feature recalibration, which leverages internal structural symmetry—through symmetric pooling pathways and parallel multi-scale convolutions—to suppress background noise and highlight salient defects. Comprehensive experiments on the public NEU-DET dataset show that PSC-YOLO achieves superior performance, obtaining a mAP@0.5 of 78.3% and a mAP@0.5:0.95 of 48.3%, while maintaining a real-time inference speed of 2.8 ms per image. This demonstrates the model’s strong potential for deployment on industrial production lines, enabling high-precision, real-time quality inspection.

Keywords:

steel surface defect detection; YOLO; Pinwheel-shaped Convolution; Pyramid Sparse Transformer; Channel-Prior Convolutional Attention

1. Introduction

Steel is a cornerstone material for modern infrastructure and industry, finding critical applications in construction, transportation, and manufacturing. The integrity of its surface is paramount, as defects directly undermine the structural safety, durability, and value of the final product [1,2]. In industrial production, surface imperfections such as scratches, inclusions, and pits are inevitable due to equipment wear, process fluctuations, and environmental interference [3,4]. These defects not only compromise aesthetics but, more critically, can act as stress concentrators, leading to catastrophic failures in safety-critical applications like aerospace and automotive manufacturing [5]. Therefore, developing high-precision, high-efficiency automated detection systems is essential for ensuring product quality, optimizing production processes, and minimizing potential safety hazards.

The pursuit of such automated systems is a key endeavor within the field of Structural Health Monitoring (SHM), which aims to provide comprehensive integrity assessment for engineering structures. SHM leverages diverse, complementary methodologies. For instance, techniques like Acoustic Emission (AE) tomography are highly effective for interrogating internal material integrity and tracking subsurface damage evolution [6]. In contrast, the task of surface quality inspection, which is the focus of this work, is most directly addressed through vision-based approaches. Thus, our research on automated visual surface defect detection represents a critical and complementary component of a broader SHM ecosystem, working in concert with other non-destructive evaluation techniques to ensure overall structural safety.

The evolution of this vision-based approach for surface inspection has progressed through distinct technological phases. Initially, manual visual inspection was widely employed; while straightforward, this method is inherently subjective, labor-intensive, and prone to fatigue-induced errors, rendering it unsuitable for modern high-throughput production lines [7,8]. Subsequent automation efforts leveraged traditional machine vision and classical image processing techniques. These approaches utilized hand-crafted features—such as Local Binary Patterns (LBP), histogram statistics, and edge descriptors—combined with shallow machine learning models like Support Vector Machines (SVM) for classification [9,10,11]. Although an improvement over manual checks, their performance heavily relies on expert-designed features, lacking the robustness and generalization required to handle the vast diversity of defect morphologies, complex backgrounds, and varying lighting conditions in industrial environments.

The advent of deep learning, particularly Convolutional Neural Networks (CNNs), has revolutionized the field. CNNs’ ability to automatically learn hierarchical feature representations from data has led to unprecedented performance. Early research focused on classification networks, achieving remarkable accuracy in identifying the type of defect present in an image [12,13]. However, classification alone is insufficient for industrial quality control, as it cannot pinpoint the location of defects. This limitation propelled the adoption of object detection frameworks. Single-stage detectors like the YOLO (You Only Look Once) series have become particularly popular due to their compelling trade-off between speed and accuracy, which is crucial for real-time inspection [14,15]. For even finer-grained analysis, semantic segmentation networks have been explored to achieve pixel-level defect delineation, providing precise shape and size information for quality analysis [16,17]. In related domains such as crack characterization, recent studies have demonstrated that attention-based architectures can achieve high efficiency in segmenting complex, thin cracks with minimal computational load [18].

Despite these significant advancements, a critical gap persists between academic research and industrial deployment. Many state-of-the-art models prioritize accuracy at the expense of computational complexity, resulting in large parameter counts and high inference latency. This makes them unsuitable for deployment on the resource-constrained edge computing devices typically available on the factory floor [19]. Consequently, the pursuit of lightweight and efficient architectures has become a central focus in recent years [20]. Strategies include employing lightweight backbone networks like MobileNet [19], utilizing depthwise separable convolutions [21], and designing efficient attention mechanisms to enhance feature representation without prohibitive computational overhead [22,23].

Notwithstanding these efforts, several challenges persist in developing a high-performance detector based on a modern architecture like YOLOv11 for steel defect inspection. First, standard convolutional kernels in CNNs may be suboptimal for capturing the complex, anisotropic textures characteristic of certain steel defects, such as elongated scratches or irregular inclusions. Second, CNNs inherently possess a limited receptive field, hindering their ability to model long-range contextual dependencies. This global context is vital for distinguishing subtle defects from a cluttered background and for detecting small, low-contrast anomalies that might be overlooked. While Vision Transformers (ViTs) excel at global modeling, their quadratic computational complexity often makes them prohibitive for dense prediction tasks in real-time applications. Finally, there is a need for more effective yet lightweight feature refinement mechanisms that can dynamically suppress irrelevant background noise and highlight salient defect features across different scales.

Our approach to these challenges is guided by a coherent design philosophy, which is articulated through the following key terms:

Anisotropic Defects: Refer to surface imperfections whose visual characteristics (e.g., shape, texture) exhibit strong directional variations, such as elongated scratches. Capturing these direction-variant features is a primary challenge.
Structural Symmetry: This is our overarching design principle, referring to the incorporation of balanced, symmetrical patterns in a module’s architecture. It manifests as rotational symmetry in kernels, hierarchical symmetry in attention paths, and symmetrical component pathways for stable feature refinement.
Orientation Awareness: This describes a model’s capability to perceive features along multiple directions, which is directly engineered to address anisotropic defects.

This philosophy enables complementary capabilities: orientation awareness (PConv) for directional defects, global context awareness (PST) for long-range dependencies, and dynamic feature recalibration (CPCA) for salient defect highlighting.

Driven by this philosophy, this paper introduces key enhancements to the YOLOv11 framework, aiming to simultaneously improve its feature perception, global understanding, and feature selection capabilities. Our main contributions are summarized as follows:

(1): We integrate Pinwheel-shaped Convolution (PConv) [24] for enhanced orientation awareness. Its rotationally symmetric kernel design ensures comprehensive and balanced feature extraction across all orientations, which is particularly effective for representing anisotropic defects (e.g., scratches) by explicitly capturing their multi-directional textural features.
(2): We introduce a Pyramid Sparse Transformer (PST) [25] for global context capture. The module’s symmetrical coarse-to-fine attention mechanism establishes a balanced computational paradigm for multi-scale feature integration.
(3): We embed a Channel-Prior Convolutional Attention (CPCA) [26] mechanism for precise feature recalibration. Its design incorporates structural symmetry within its components—such as symmetric pooling in channel attention and parallel multi-scale depthwise convolutions in spatial attention. This enables a synergistic and computationally efficient refinement of features, which effectively suppresses background noise and dynamically amplifies salient defect regions.

Comprehensive experiments on public datasets demonstrate that the improved model proposed in this study achieves a superior balance between detection accuracy and inference speed, offering a powerful solution for real-time and accurate steel surface defect inspection.

2. Methods

2.1. Network Architecture

This paper proposes an improved model based on YOLOv11n. The overall architecture comprises three components: a Backbone network, a Neck network, and a Detection Head, designed to achieve efficient multi-scale feature learning and fusion. This design enhances detection accuracy while maintaining real-time performance. The complete network structure is illustrated in Figure 1.

The Backbone extracts hierarchical features from the input image. It consists of a series of standard Convolutional Blocks (CBS) and C3k2 modules, which progressively downsample the input to expand the receptive field and increase the channel count. In the deeper layers, features are first processed by an SPPF module to aggregate multi-scale contextual information, followed by a C2PSA module. To strengthen the representation of critical features, we introduce a Channel-Prior Convolutional Attention (CPCA) module at the terminus of the Backbone. This module recalibrates feature responses along the channel dimension, enabling the network to focus more precisely on salient information.

The Neck constructs a feature pyramid for multi-scale fusion using an FPN+PAN structure integrated with a PST module and Pinwheel-shaped Convolution (PConv). In the top-down pathway, deep features are upsampled and concatenated with their corresponding shallow features. The concatenated features are then processed by the PST module for efficient transformation and enhancement, which effectively preserves vital spatial details. Concurrently, in the bottom-up pathway, shallow features are downsampled using PConv, concatenated with deep features, and refined by C3k2 modules to fuse semantic information. This bidirectional architecture ultimately produces a powerful feature pyramid containing P3/8, P4/16, and P5/32 output levels.

The core improvements of the proposed model over the baseline are summarized as follows:

(1): We introduce the CPCA module at the end of the Backbone, where it works in concert with the original C2PSA module to sharpen the network’s focus on critical features.
(2): We integrate the PST module into the upsampling path of the Neck, leveraging its superior feature reconstruction and global context modeling to optimize the fusion process and enrich spatial and semantic information.
(3): We replace standard convolutions with PConv in the downsampling path. Its rotationally symmetric kernels are specifically designed to capture multi-directional features, enabling effective representation of anisotropic defects. This structure also significantly enlarges the receptive field with minimal parameters, collectively contributing to a more discriminative and robust feature pyramid.

2.2. Pinwheel-Shaped Convolution (PConv)

Steel surface defects often manifest as anisotropic linear features (e.g., scratches) or complex-shaped spots (e.g., pits). The isotropic nature of standard convolutional kernels is insufficient for efficiently capturing such characteristics. To address this, we incorporate the Pinwheel-shaped Convolution (PConv) [24], which employs asymmetric padding to construct direction-aware kernels, thereby offering enhanced adaptability to defects with varying orientations.

The Pinwheel-shaped Convolution (PConv) is introduced to address the limitation of standard kernels in capturing directional features. Its design philosophy differs significantly from other orientation-aware convolutions. Unlike deformable convolutions [27], which learn adaptive sampling points at a higher computational cost and complexity, PConv employs a set of fixed, rotationally symmetric kernels that are inherently hardware-efficient and stable to optimize. Compared to hand-crafted filters like Gabor, PConv’s parameters are learnable, enabling it to adapt to specific defect patterns. This makes PConv a uniquely efficient and effective solution for capturing anisotropic defect morphology on computational-constrained devices.

The architecture of the PConv module is depicted in Figure 2. Given an input tensor

X^{(h_{1}, w_{1}, c_{1})}

, the first layer of PConv performs parallel convolutional operations as follows:

X_{1}^{(h^{'}, w^{'}, c^{'})} = S i L U (B N (X_{P (1, 0, 0, 3)}^{(h_{1}, w_{1}, c_{1})} \otimes W_{1}^{(1, 3, c^{'})})), X_{2}^{(h^{'}, w^{'}, c^{'})} = S i L U (B N (X_{P (0, 3, 0, 1)}^{(h_{1}, w_{1}, c_{1})} \otimes W_{2}^{(3, 1, c^{'})})), X_{3}^{(h^{'}, w^{'}, c^{'})} = S i L U (B N (X_{P (0, 1, 3, 0)}^{(h_{1}, w_{1}, c_{1})} \otimes W_{2}^{(1, 3, c^{'})})), X_{4}^{(h^{'}, w^{'}, c^{'})} = S i L U (B N (X_{P (3, 0, 1, 0)}^{(h_{1}, w_{1}, c_{1})} \otimes W_{2}^{(3, 1, c^{'})})),

(1)

Here,

\otimes

represents the convolution operator, and

W_{1}^{(1, 3, c^{'})}

denotes a 1 × 3 convolutional kernel with an output channel count of c′. The padding parameter P (1,0,0,3) specifies the number of pixels padded to the left, right, top, and bottom directions, respectively. After the initial interleaved convolution operations, the relationship between the output feature map’s height (h′), width (w′), channel count (c′), and the input feature map’s dimensions is as follows:

c^{'} = \frac{c_{2}}{4}, w^{'} = \frac{w_{1}}{s} + 1, c^{'} = \frac{c_{2}}{4},

(2)

where c₂ denotes the number of channels in the final output feature map of the pinwheel-shaped convolution module, and s represents the convolution stride. The results from the initial interleaved convolution are concatenated, with the output computed as follows:

{X^{'}}^{(h^{'}, w^{'}, 4 c^{'})} = C a t (X_{1}^{(h^{'}, w^{'}, c^{'})}, \dots, X_{4}^{(h^{'}, w^{'}, c^{'})}) .

(3)

Finally, the concatenated tensor undergoes normalization through a convolution kernel

W^{(2, 2, c_{2})}

without padding. The height and width of the output feature map are adjusted to the predefined dimensions h₂ and w₂, enabling PConv to be used interchangeably with standard convolutional layers. Simultaneously, it serves as a channel attention mechanism by analyzing the contribution degrees of different convolutional orientations. The final output

Y^{(h_{2}, w_{2}, c_{2})}

is calculated as follows:

h_{2} = h^{'} - 1 = \frac{h_{1}}{s}, w_{2} = w^{'} - 1 = \frac{w_{1}}{s},

(4)

Y^{(h_{2}, w_{2}, c_{2})} = S i L U (B N ({X^{'}}^{(h^{'}, w^{'}, 4 c^{'})} \otimes W^{(2, 2, c_{2})}))

(5)

This design ensures that PConv can be used interchangeably with standard convolutional layers. Its receptive field approximates a Gaussian distribution, enabling feature extraction from multiple orientations and evaluating their contributions, which functions similarly to a channel attention mechanism. This property makes PConv particularly suitable for enhancing the model’s capability to capture the diverse morphology of steel defects.

2.3. Pyramid Sparse Transformer (PST) Module

Steel surface defects often exhibit diverse morphologies, varying scales, and sparse distribution across the image. Standard convolutional operations struggle to model the long-range dependencies and global contextual information vital for detecting such anomalies due to their limited receptive fields. To address this challenge while maintaining computational efficiency, we incorporate the Pyramid Sparse Transformer (PST) module [25]. The PST employs a triple-path architecture that efficiently integrates multi-scale features by establishing a symmetrical information flow between fine-grained and coarse-grained feature maps.

As illustrated in Figure 3, the PST module takes two adjacent feature maps as input: a high-resolution, fine-grained feature map F1 rich in spatial detail, and a low-resolution, coarse-grained feature map F2 carrying high-level semantics. The module’s operation can be clearly understood through its three complementary processing pathways:

Path B (Coarse-grained Attention for Global Context):

This path is responsible for efficient global modeling. We first derive queries (Q) from the fine-grained feature

F 1 \in R^{C \times H \times W}

while keys (K) and values (V) are projected from the semantically rich, coarse-grained feature

F 2 \in R^{C \times \frac{H}{2} \times \frac{W}{2}}

. This cross-layer setup allows high-level semantics from F2 to guide the enhancement of F1. The coarse attention is computed as

O_{c o a r s e} = s o f t \max (Q K^{T} / \sqrt{d_{k}}) V

. By operating on the downsampled token space of F2, this stage reduces the computational complexity to

\frac{1}{4} O (N^{2})

.

Path A (Fine-grained Sparse Attention for Local Detail):

This path recovers critical spatial details that may be lost in the coarse process. The coarse attention matrix from Path B provides a global similarity score, which is used to select the top-k most relevant tokens from the original, high-resolution feature map F1. A refined, fine-grained attention

O_{f i n e} = s o f t m a x (Q {K_{f i n e}^{s e l}}^{T}) / \sqrt{d_{k}}) V_{f i n e}^{s e l}

is then computed exclusively on this sparse subset. This design confines the complexity of this detail-retrieving stage to O(4Nk), making fine-grained attention on high-resolution features computationally tractable.

Path C (Identity Projection for Feature Stability):

In parallel, this path preserves and enhances the original features. The value features (V) from the coarse-grained input F2 first undergo a Position Encoding Convolution (PEC) to incorporate spatial structure information. This is followed by a linear projection. This path acts as a stable residual pathway, ensuring that critical foundational features are not lost or distorted during the complex attention computations, thereby improving gradient flow and training stability.

The final output O of the PST module is obtained by fusing the outputs from all three pathways: the globally-aware

O_{c o a r s e}

(Path B), the detail-rich

O_{f i n e}

(Path A), and the stable identity projection (Path C), further enhanced by the addition of a PEC module. The overall complexity is thus drastically reduced from the standard

O {(N}^{2})

to

\frac{1}{4} O {(N}^{2}) + O (4 N k)

, achieving nearly linear complexity with respect to the token count N.

This innovative triple-path design creates a powerful and efficient analysis pipeline: Path B establishes a symmetry of global context, Path A ensures a symmetry of local detail, and Path C maintains feature stability. This makes the PST module exceptionally suitable for characterizing the diverse and sparse defects found in steel surface imagery.

It is noteworthy that the entire PST module is designed with hardware friendliness in mind. Inspired by EfficientFormer, it uniformly implements all linear and layer normalization layers using a “1 × 1 Convolution + Batch Normalization” combination. This unified design, relying solely on convolutions and attention, enables better adaptation to modern hardware accelerators.

To integrate the PST module into our detection system, we adopt the PST-DET architecture depicted in Figure 4. In this architecture, PST modules serve as the core components for enhancing the Feature Pyramid Network (FPN), replacing the original convolutional layers to process features from the P3, P4, and P5 pyramid levels. The PST modules adapt to varying input resolutions through necessary upsampling and downsampling operations. This multi-scale integration establishes a symmetrical processing flow across all feature levels, ensuring consistent global context modeling while preserving the hierarchical symmetry of the feature pyramid. The focus of our work lies in applying this advanced PST-DET architecture to the steel defect detection task and systematically evaluating its effectiveness in enhancing multi-scale defect feature fusion via cross-layer sparse attention.

2.4. Channel Prior Convolutional Attention (CPCA) Mechanism

While advanced modules like PST enhance long-range dependency modeling, effectively highlighting critical local features in complex scenes remains challenging. To address this, we introduce the Channel Prior Convolutional Attention (CPCA) mechanism [26]. CPCA is designed to achieve a dynamic, efficient, and channel-aware distribution of attention. The CPCA mechanism incorporates structural symmetry within its internal components and a sequential yet synergistic processing flow across channel and spatial dimensions. Its key innovation is the “channel prior” concept, which respects and maintains the independence of feature maps across different channels during spatial attention computation, thereby avoiding enforced uniform weighting. This enables the model to adaptively focus on spatially critical regions based on each channel’s unique characteristics, achieving precise capture and enhancement of diverse local defects, building upon the global context provided by PST.

The CPCA module consists of two sequentially executed components: Channel Attention (CA) and Spatial Attention (SA), with its core architecture illustrated in Figure 5.

(1): Channel Attention (CA) Module

This module recalibrates the importance of each channel by aggregating global information. The channel attention module utilizes symmetric pooling pathways—average pooling extracts global contextual features while max pooling highlights salient local features. The input feature map

F \in R^{C \times H \times W}

undergoes simultaneous global average pooling and global max pooling to capture complementary contextual information. The results from these two pooling operations are then summed and processed by a shared Multi-Layer Perceptron (MLP). This symmetric aggregation maintains feature equilibrium. A sigmoid activation function subsequently generates the one-dimensional channel attention weights

M_{c}

. This process is formulated as:

M_{c} (F) = σ (M L P (A v g P o o l (F)) + M L P (M a x P o o l (F)))

(6)

where

σ

denotes the sigmoid function, and

M_{c} (F) \in {[0, 1]}^{C}

. These weights are then used to scale the input features channel-wise, yielding the channel-refined feature

F_{c}

:

F_{c} = M_{c} (F) ⊙ F

(7)

(2): Spatial Attention (SA) Module

Having recalibrated the channel-wise importance, the Spatial Attention (SA) module then enforces the channel prior principle by processing the refined features

F_{c}

. This module forms the innovative core of CPCA. The spatial attention module adopts architectural symmetry through parallel multi-scale depth-wise convolutions, ensuring balanced receptive fields across scales. Diverging from approaches that generate a single, shared spatial map, the SA module employs a set of parallel multi-scale depthwise separable convolutions to extract spatial features. This design adheres to the channel prior principle, enabling the independent capture of multi-scale spatial context for each channel, which significantly enhances flexibility and reduces computational cost. Specifically, depthwise convolutions with different kernel sizes (e.g., 7, 11, 21) process

F_{c}

in parallel.

Furthermore, the multi-scale depth-wise convolution module incorporates structural symmetry through factorized kernels (e.g., using 1 × 7 and 7 × 1 convolutions), maintaining a balance in directional sensitivity. Their outputs are concatenated and fused by a 1 × 1 convolution, followed by a sigmoid activation, to produce a refined three-dimensional spatial attention map

M_{S}

:

M_{S} (F_{c}) = σ (C o n v_{1 \times 1} (C o n c a t (D w C o n v_{k_{1}} (F_{c}), D w C o n v_{k_{2}} (F_{c}), D w C o n v_{k_{3}} (F_{c}))))

(8)

where

M_{S} (F_{c}) \in R^{C \times H \times W}

.

The CPCA module is integrated into the network in a sequential feed-forward manner. For an input feature F, the final output Y is given by:

Y = M_{S} (F_{c}) ⊙ F_{c}

(9)

This computational flow clearly demonstrates a “channel-first, spatial-later” sequential design: the input features are first calibrated along the channel dimension by the channel attention, enhancing the more important feature channels. The calibrated features are then refined spatially by the spatial attention, focusing on key regions. This design allows CPCA to progressively optimize the feature representation in a computationally efficient manner, thereby significantly enhancing the model’s capability to perceive subtle steel surface defects against cluttered backgrounds.

3. Experiment

3.1. Dataset

To systematically evaluate the effectiveness of the proposed method for steel surface defect detection, experiments were conducted on the widely recognized benchmark dataset, NEU-DET [28]. This dataset contains a total of 1800 grayscale images, each with a resolution of 200 × 200 pixels. It encompasses six common types of steel surface defects: crazing, inclusion, patches, pitted surface, rolled-in scale, and scratches. Each image is annotated with precise bounding boxes, making the dataset highly suitable for the object detection task.

The image distribution across the six defect categories is balanced, with each category containing exactly 300 images. However, the number of defect instances per image varies, presenting a significant challenge for models to accurately localize multiple defects. The key statistics and detailed descriptions of the defect categories are summarized in Table 1.

To ensure a fair and robust evaluation, we followed a common practice by randomly splitting the entire dataset into training, validation, and test sets with a ratio of 8:1:1. Specifically, 1440 images were used for model training, 180 for hyperparameter tuning and model selection, and the remaining 180 for the final performance evaluation.

3.2. Evaluation Metrics

In alignment with the evaluation protocols of the YOLO family, we adopt multiple metrics to comprehensively assess detection performance. These include Precision (Pr), Recall (Re), mAP@0.5 (mean Average Precision at an IoU threshold of 0.5), and mAP@0.5:0.95 (mean Average Precision averaged over all IoU thresholds from 0.5 to 0.95 with an interval of 0.05). Of these, mAP@0.5:0.95 is the most critical metric for evaluating detection accuracy, as it provides a more comprehensive and robust measure by considering different object scales and IoU thresholds.

Additionally, we report the number of model parameters and GFLOPs to quantify model complexity while measuring end-to-end inference latency per image (ImgLatency) to assess detection speed. Importantly, ImgLatency incorporates three stages: pre-processing, inference, and post-processing, ensuring a complete and accurate evaluation of real-world performance.

3.3. Experimental Environment and Parameter Settings

All experiments, including training, validation, and most importantly, the inference speed benchmarking, were conducted on a workstation with the following hardware configuration: an NVIDIA GeForce RTX 3090 GPU, an Intel i9-10900X CPU, and 32 GB of system memory. The software environment consisted of Windows 10 operating system, utilizing Python 3.10 alongside PyTorch version 1.13.1 configured with CUDA 11.7.

Training was conducted with constant input image dimensions of 640 by 640 pixels. The model underwent training for 300 epochs under the following configuration: an AdmW optimizer was employed, batch size was set to 16, initial learning rate was established at 0.001, with a momentum value of 0.9 and weight decay parameter of 0.0005. The training process incorporated a warmup period during the initial three epochs, which was succeeded by cosine annealing of the learning rate to facilitate a smoother progression in gradient descent. Standard YOLO data augmentation strategies were applied, including Mosaic, random horizontal flipping, translation, scaling, and color space adjustments (hue, saturation, value). RandAugment and random erasing were also employed to enhance robustness. This consistent configuration across all compared models (both our PSC-YOLO and the baselines) ensures that the reported performance gains are attributable to the architectural innovations rather than to differences in hyperparameter tuning.

3.4. Experimental Results and Discussion

To comprehensively evaluate the effectiveness of our proposed PSC-YOLO model, we conducted a comparative analysis against current mainstream lightweight YOLO models. To ensure statistical significance and mitigate the effects of randomness inherent in small-scale datasets, all experiments were independently conducted three times; the key performance metrics are therefore reported as the mean ± standard deviation. All experiments were performed under the same steel defect dataset and experimental environment to ensure fairness. The detailed performance metrics are compared in Table 2.

A comprehensive analysis of the data in Table 2 leads to the conclusion that our proposed PSC-YOLO model achieves the best overall performance in terms of the balance between detection accuracy and efficiency.

PSC-YOLO achieved optimal performance across all core accuracy metrics. Specifically, our model reached 0.783 in mAP@0.5, which is an increase of 0.031 (a relative improvement of 4.1%) compared to the baseline model YOLOv11n (0.752). On the mAP@0.5:0.95 metric, PSC-YOLO achieved 0.483, an increase of 0.025 (a relative improvement of 5.5%) over YOLOv11n’s 0.458. Concurrently, the recall rate (Re) of PSC-YOLO reached 0.761, a significant increase of 0.041 (a relative improvement of 5.7%) compared to YOLOv11n’s 0.720. This notable enhancement in recall indicates that our model can more comprehensively identify defect targets in images, effectively reducing missed detections, which is crucial for steel defect detection where safety requirements are high.

To place these results in a broader context, we extended our comparison to include the Transformer-based RT-DETR R18 [29] detector. The results reveal a distinct performance-efficiency trade-off. While RT-DETR achieves a marginally higher mAP@0.5:0.95 (48.7%), it incurs a substantial computational cost, with 21.86M parameters and an inference latency of 23.59 ms. In contrast, PSC-YOLO delivers highly competitive accuracy (48.3% mAP@0.5:0.95) with only 2.92M parameters and a latency of 2.8 ms—making it over eight times faster. This comparison underscores the superior efficiency of PSC-YOLO for edge deployment.

This comprehensive improvement in accuracy is attributed to the enhanced feature representation capabilities introduced by the Pinwheel-shaped Convolution (PConv), Pyramid Sparse Transformer (PST), and Channel Prior Convolutional Attention (CPCA) modules. These modules work in concert, optimizing the network by capturing features of anisotropic defects, modeling global context, and performing dynamic feature calibration, respectively, ultimately working synergistically to contribute to a comprehensive leap in model performance. The specific contributions and mechanisms of each module will be quantitatively analyzed and validated in detail in the ablation studies of the next section.

3.5. Ablation Studies

To systematically validate the effectiveness of the proposed modules and investigate their optimal integration, we conducted rigorous ablation studies based on the YOLOv11n baseline using the same steel defect dataset. Our analysis follows the principle of providing a deeper understanding of model behavior through detailed component evaluation, as advocated in rigorous defect characterization studies [30].

To this end, our ablation studies are structured around the following four dimensions: (1) the individual and synergistic contributions of the modules; (2) the impact of the Pinwheel-shaped Convolution (PConv) placement on model performance; (3) the impact of the CPCA attention mechanism’s hierarchical placement on model performance; and (4) a performance comparison of different attention mechanisms.

3.5.1. Effectiveness of Individual Modules

To quantitatively assess the contributions of the three core modules—Pinwheel-shaped Convolution (PConv), Pyramid Sparse Transformer (PST), and Channel Prior Convolutional Attention (CPCA)—we conducted systematic ablation experiments using YOLOv11n as the baseline. All reported results are averages from three independent runs (Table 3).

Analysis of Individual Module Contributions

The experiment first evaluated the independent efficacy of each module.

Contribution of the PST Module: Introducing the PST alone led to a relative improvement of 2.2% in the stringent mAP@0.5:0.95 and a 0.5% improvement in mAP@0.5. This performance gain, particularly in localization accuracy, directly validates PST’s efficacy as an efficient feature fusion module that is designed to preserve spatial detail. The enhancement is primarily attributed to its coarse-to-fine token selection mechanism. In the context of steel defect detection, where accurately localizing low-contrast defects against a complex background is critical, the cross-layer coarse attention efficiently identifies globally salient regions, which are then refined by the sparse fine attention, thereby resolving ambiguities and enhancing boundary perception. In terms of efficiency, PST reduced GFLOPs from 6.3 to 6.1, while the inference time increased by a controllable 0.5 ms. The results indicate that the PST module effectively enhances model performance, with a notably improvement in overall localization precision, while introducing a manageable computational overhead.
Contribution of the PConv Module: Introducing Pconv alone resulted in stable improvements in both mAP@0.5 and mAP@0.5:0.95, by 0.4% and 1.7%, respectively. This demonstrates that its pinwheel-shaped structure effectively captures the features of anisotropic defects (such as scratches in different directions) and better matches the morphological characteristics of such targets. Crucially, while delivering these performance gains, Pconv maintained computational complexity (6.3 GFLOPs) and inference speed (2.3 ms) identical to the baseline model, highlighting its exceptional computational efficiency.

2.: Synergistic Effects Between Modules

Upon validating the independent effectiveness of the modules, we further investigated their combinatorial effects.

Synergy between PST and PConv: Combining PST and PConv led to a comprehensive synergistic performance improvement. The model achieved 0.476 in mAP@0.5:0.95 (a 3.9% increase over the baseline) and 0.760 in mAP@0.5 (a 1.1% increase). This indicates that within the neck network, the global semantic guidance provided by PST in the upsampling path and the efficient receptive field expansion coupled with anisotropic feature capture achieved by PConv in the downsampling path are highly complementary, jointly building a more powerful multi-scale feature fusion pipeline.
Final Optimizing Effect of CPCA: Building upon the enhanced neck network, the CPCA attention module was further integrated at the end of the backbone network to form the complete PSC-YOLO. CPCA dynamically allocates attention weights across both channel and spatial dimensions and performs adaptive calibration on the backbone features, highlighting critical information and suppressing background interference. This provides a high-quality information source for subsequent fusion, thereby driving a decisive leap in model performance: Recall (Re) surged from 0.725 to 0.761, while mAP@0.5 and mAP@0.5:0.95 increased by 3.0% and 1.5%, respectively. Notably, while improving performance, CPCA further reduced GFLOPs to 5.7 and maintained the inference time constant at 2.8 ms, confirming its efficient design.

Table 3. Impact of Improved Module on Model Performance.

Methods	Precision	Recall	mAP0.5	mAP0.5:0.95	Params/M	GFLOPs	ImgLatency/ ms
YOLO11n	0.758	0.720	0.752 ± 0.010	0.458 ± 0.005	2.46	6.3	2.3
YOLO11n + PST	0.748	0.722	0.756 ± 0.006	0.468 ± 0.011	2.45	6.1	2.8
YOLO11n + PConv	0.751	0.716	0.755 ± 0.002	0.466 ± 0.004	2.40	6.3	2.3
YOLO11n + PST + PConv	0.734	0.725	0.760 ± 0.005	0.476 ± 0.004	2.38	6.0	2.8
YOLO11n + PST + PConv + CPCA (ours)	0.763	0.761	0.783 ± 0.009	0.483 ± 0.013	2.92	5.7	2.8

3.5.2. Performance Analysis of PConv Placement Strategies

To investigate the optimal placement of the Pinwheel-shaped Convolution (PConv), we compared three configurations, as shown in Table 4.

The results indicate that placing PConv solely in the neck network yields the best overall performance across detection accuracy (mAP@0.5: 0.783), recall (Re: 0.761), parameter efficiency (2.92M), and inference speed (2.8 ms). In contrast, placing it in the backbone or across all networks, while offering slight advantages in individual metrics (e.g., Precision in the Backbone configuration), comes at the cost of significantly sacrificed recall, lower inference speed, and higher computational overhead. This proves that concentrating PConv application in the neck network, responsible for feature integration, is the key design for achieving the optimal balance between detection performance, parameter efficiency, and inference speed.

Table 4. Impact of PConv Placement on Model Performance.

Methods	Precision	Recall	mAP0.5	mAP0.5:0.95	Params/M	GFLOPs	ImgLatency /ms
Neck	0.763	0.761	0.783 ± 0.009	0.483 ± 0.013	2.92	5.7	2.8
Backbone	0.765	0.721	0.767 ± 0.004	0.472 ± 0.002	2.94	5.8	3.4
ALL	0.755	0.737	0.770 ± 0.018	0.468 ± 0.009	2.87	5.9	3.4

3.5.3. Performance Analysis of CPCA Placement Strategies

We further investigated the impact of the hierarchical placement of the Channel Prior Convolutional Attention (CPCA) module on performance (Table 5).

The experiments show that placing CPCA at the end of the backbone network (after C2PSA) enables the simultaneous achievement of high recall (Re = 0.761) and high localization accuracy (mAP@0.5:0.95 = 0.483). This placement allows it to perform a global calibration and ‘purification’ on the semantically rich features before they enter the complex neck network, highlighting the information most relevant to the task. When CPCA is moved to the neck network, although the parameter count and computational load decrease, both its recall and overall accuracy drop significantly. This indicates that performing attention calibration after the feature pyramid is constructed is substantially less effective. Therefore, positioning CPCA at the end of the backbone network is optimal for its role as a feature preprocessing module.

Table 5. Impact of CPCA Placement on Model Performance.

Methods	Precision	Recall	mAP0.5	mAP0.5:0.95	Params/M	GFLOPs	ImgLatency /ms
Backbone (after C2PSA)	0.763	0.761	0.783 ± 0.009	0.483 ± 0.013	2.92	5.7	2.8
Neck (P3)	0.763	0.735	0.771 ± 0.003	0.461 ± 0.009	2.39	6.2	3.1
Neck (P3 + P4 + P5)	0.765	0.745	0.780 ± 0.002	0.471 ± 0.007	2.55	5.6	3.3

3.5.4. Comparative Analysis of Different Attention Mechanisms

To validate the superiority of the CPCA attention mechanism, we compared it against several mainstream attention modules under identical settings; the results are shown in Table 6.

Table 6. Impact of Different Attention Mechanisms on Model Performance.

Methods	Precision	Recall	mAP0.5	mAP0.5:0.95	Params/M	GFLOPs	ImgLatency/ms
CPCA [26]	0.763	0.761	0.783 ± 0.009	0.483 ± 0.013	2.92	5.7	2.8
EMA [31]	0.761	0.725	0.766 ± 0.009	0.467 ± 0.003	2.81	5.6	2.8
CAA [32]	0.766	0.745	0.777 ± 0.008	0.476 ± 0.009	2.93	5.6	2.7
CBAM [33]	0.765	0.753	0.774 ± 0.006	0.466 ± 0.009	2.86	5.5	2.7
ELA [34]	0.753	0.735	0.768 ± 0.013	0.465 ± 0.006	3.24	5.6	2.7
SE [35]	0.751	0.725	0.763 ± 0.007	0.468 ± 0.003	2.81	5.5	2.8
LSKA [36]	0.754	0.729	0.763 ± 0.011	0.468 ± 0.006	2.87	5.6	2.8

The experiments demonstrate that the adopted CPCA module delivers the best comprehensive performance. It ranks first in recall (Re = 0.761), mAP@0.5 (0.783), and mAP@0.5:0.95 (0.483), which benefits from its ability to perform dynamic and efficient weight allocation across both channel and spatial dimensions. Specifically, while CAA achieves the second-best performance in accuracy (mAP), its recall is relatively low. CBAM offers decent recall, but its localization accuracy (mAP@0.5:0.95) is insufficient, which relates to its design where the spatial attention module generates a single, shared attention map for all channels. The overall performance of other attention mechanisms lags significantly behind CPCA. These experimental results fully substantiate the effectiveness and advancement of the CPCA module for the steel defect detection task, and its superior feature calibration capability is a vital component in the final model’s success.

4. Conclusions

This paper has presented PSC-YOLO, a novel and efficient steel defect detection model designed to address the specific challenges of industrial inspection. Our core design philosophy leverages principles of symmetry to guide feature representation and fusion. By strategically enhancing the YOLOv11n architecture with three key components, the model achieves a superior balance between accuracy and speed.

The integration of Pinwheel-shaped Convolution (PConv) empowers the network with enhanced orientation awareness through its rotationally symmetric kernel design, allowing it to better represent the anisotropic characteristics of defects like scratches and inclusions. The incorporation of the Pyramid Sparse Transformer (PST) module effectively captures long-range, cross-scale global context through its symmetrical coarse-to-fine attention mechanism, which is crucial for identifying small and low-contrast defects amidst cluttered backgrounds, all while maintaining manageable computational costs through its sparse attention mechanism. Furthermore, the Channel-Prior Convolutional Attention (CPCA) mechanism enables precise, dynamic recalibration of features via its structurally symmetric components and synergistic channel-spatial processing flow, thereby focusing the network’s attention on critical defect regions.

Comprehensive experiments and ablation studies on the NEU-DET dataset have unequivocally validated the effectiveness of our approach. PSC-YOLO outperforms several state-of-the-art lightweight YOLO models in terms of both detection accuracy (mAP and Recall) and efficiency (GFLOPs and Latency). The success of our model underscores the efficacy of incorporating symmetric design principles, such as rotational symmetry in PConv, hierarchical symmetry in PST, and internal structural symmetry in CPCA, to handle the inherent asymmetries and variations in real-world steel defects. The studies also provided insights into the optimal placement of these modules, confirming that positioning PConv in the neck and CPCA at the terminus of the backbone yields the best performance.

Notwithstanding these promising results, we acknowledge the limitations of our current evaluation, which is primarily based on the NEU-DET benchmark. While this dataset is balanced and widely adopted, its relatively small scale and limited diversity in image resolution and defect manifestations may constrain the model’s generalizability to more complex industrial environments. The performance of PSC-YOLO on entirely new defect types or under significantly different imaging conditions remains an area for future validation. Consequently, our future work will focus on two main directions: first, to rigorously assess the model’s robustness on larger, more diverse, and higher-resolution proprietary industrial datasets (e.g., the Severstal dataset); and second, to explore techniques such as domain adaptation to enhance its generalization capability across different production settings. Additionally, the robustness of the proposed model against common industrial perturbations—such as image noise, varying lighting conditions, and partial surface contamination—has not been systematically evaluated in this study. Investigating the model’s resilience to these challenging but realistic scenarios will constitute another critical avenue for our future work.

Furthermore, while our reported latency (2.8 ms/image) demonstrates the model’s computational efficiency on a server-grade GPU (RTX 3090), we acknowledge that this does not fully represent its performance on resource-constrained edge hardware. The actual deployment and comprehensive evaluation on devices such as Jetson Nano or Raspberry Pi, including metrics like power consumption and memory footprint, remain a critical direction for our future work to fully validate its edge suitability.

In conclusion, PSC-YOLO offers a robust and practical solution for automated steel surface defect detection, holding significant potential for deployment in real-world industrial quality control systems. The architectural insights and symmetric design principles presented herein are also expected to inspire future research in related fields of industrial visual inspection.

Author Contributions

Conceptualization, S.G. and G.Y.; methodology, G.Y.; software, X.G. and C.W.; validation, X.G. and C.W.; formal analysis, S.G. and M.C.; investigation, S.G. and M.C.; resources, S.G. and G.Y.; data curation, M.C. and X.G.; writing—original draft preparation, S.G. and G.Y.; writing—review and editing, S.G. and G.Y.; visualization, C.W. and X.G.; supervision, G.Y.; project administration, G.Y.; funding acquisition, S.G. and G.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Hubei Provincial Key R&D Program of China, grant number 2024BAB110.

Data Availability Statement

The original data presented in the study are openly available in at https://drive.google.com/open?id=1qrdZlaDi272eA79b0uCwwqPrm2Q_WI3k accessed on 13 October 2024.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Yang, T.; Liu, M.; Meng, Y.N.; Zhang, F.; Liu, S.J.; Mo, C.C. Improved Steel Surface Defect Detection Algorithm Based on Yolov5s Network. Mach. Tool Hydraul. 2024, 52, 19–26. [Google Scholar]
Mi, C.F.; Lu, K.; Wang, W.Y.; Wang, B. Research Progress on Hot-rolled Strip Surface Defect Detection Based on Machine Vision. J. Anhui Univ. Technol. (Nat. Sci.) 2022, 39, 180–188. [Google Scholar]
Song, K.; Feng, H.; Cao, T.; Cui, W.; Yan, Y. MFANet: Multifeature Aggregation Network for Cross-Granularity Few-Shot Seamless Steel Tubes Surface Defect Segmentation. IEEE Trans. Ind. Informat. 2024, 20, 9725–9735. [Google Scholar] [CrossRef]
Konovalenko, I.; Maruschak, P.; Brezinová, J.; Prentkovskis, O.; Brezina, J. Research of U-Net-based CNN architectures for metal surface defect detection. Machines 2022, 10, 327. [Google Scholar] [CrossRef]
Yu, J.; Cheng, X.; Li, Q. Surface Defect Detection of Steel Strips Based on Anchor-Free Network with Channel Attention and Bidirectional Feature Fusion. IEEE Trans. Instrum. Meas. 2022, 71, 5000710. [Google Scholar] [CrossRef]
Das, A.K.; Leung, C.K. Fast Tomography: A greedy, heuristic, mesh size–independent methodology for local velocity reconstruction for AE waves in distance decaying environment in semi real-time. Struct. Health Monit. 2022, 21, 1555–1573. [Google Scholar] [CrossRef]
Zhang, R.; Fu, M.; Chen, X. Steel surface defect detection algorithm based on YOLOv5s. Sci. Technol. Eng. 2024, 24, 9980–9988. [Google Scholar]
Chen, G.; Wang, H.; Chen, K.; Li, Z.; Song, Z.; Liu, Y.; Chen, W.; Knoll, A. A Survey of the Four Pillars for Small Object Detection: Multiscale Representation, Contextual Information, Super-Resolution, and Region Proposal. IEEE Trans. Syst. Man Cybern. Syst. 2022, 52, 936–953. [Google Scholar] [CrossRef]
Luo, Q.; Fang, X.; Liu, L.; Yang, C.; Sun, Y. Automated Visual Defect Detection for Flat Steel Surface: A Survey. IEEE Trans. Instrum. Meas. 2020, 69, 626–644. [Google Scholar] [CrossRef]
Ojala, T.; Pietikäinen, M.; Harwood, D. A Comparative Study of Texture Measures with Classification Based on Featured Distributions. Pattern Recogn. 1996, 29, 51–59. [Google Scholar] [CrossRef]
Song, K.; Yan, Y. A Noise Robust Method Based on Completed Local Binary Patterns for Hot-Rolled Steel Strip Surface Defects. Appl. Surf. Sci. 2013, 285, 858–864. [Google Scholar] [CrossRef]
Liu, Y.; Yuan, Y.; Balta, C.; Liu, J. A Light-Weight Deep-Learning Model with Multi-Scale Features for Steel Surface Defect Classification. Materials 2020, 13, 4629. [Google Scholar] [CrossRef]
Yang, Y.; Zhao, M. CNN-based Strip Steel Surface Defect Detection. Heavy Mach. 2019, 2, 25–29. [Google Scholar]
Li, W.G.; Ye, X.; Zhao, Y.T.; Wang, W.B. Strip Steel Surface Defect Detection Based on Improved YOLOv3 Algorithm. Acta Electron. Sin. 2020, 48, 1284–1292. [Google Scholar]
Cheng, X.; Yu, J. Retinanet with Difference Channel Attention and Adaptively Spatial Feature Fusion for Steel Surface Defect Detection. IEEE Trans. Instrum. Meas. 2021, 70, 2503911. [Google Scholar] [CrossRef]
Dong, H.; Song, K.; He, Y.; Xu, J.; Yan, Y.; Meng, Q. PGA-net: Pyramid Feature Fusion and Global Context Attention Network for Automated Surface Defect Detection. IEEE Trans. Ind. Informat. 2020, 16, 7448–7458. [Google Scholar] [CrossRef]
Zhou, X.; Fang, H.; Liu, Z.; Zheng, B.; Sun, Y.; Zhang, J.; Yan, C. Dense Attention-Guided Cascaded Network for Salient Object Detection of Strip Steel Surface Defects. IEEE Trans. Instrum. Meas. 2022, 71, 5004914. [Google Scholar] [CrossRef]
Das, A.K.; Leung, C.K.Y. A Novel Technique for High-Efficiency Characterization of Complex Cracks with Visual Artifacts. Appl. Sci. 2024, 14, 7194. [Google Scholar] [CrossRef]
Liu, J.; Cui, G.; Xiao, C. A Real-Time and Efficient Surface Defect Detection Method Based on Yolov4. J. Real-Time Image Process. 2023, 20, 77. [Google Scholar] [CrossRef]
Zhao, S.; Li, G.; Zhou, M.; Li, M. YOLO-CEA: A Real-Time Industrial Defect Detection Method Based on Contextual Enhancement and Attention. Clust. Comput. 2023, 27, 2329–2344. [Google Scholar] [CrossRef]
Ma, Z.; Li, Y.; Huang, M.; Huang, Q.; Cheng, J.; Tang, S. Automated Real-Time Detection of Surface Defects in Manufacturing Processes of Aluminum Alloy Strip Using a Lightweight Network Architecture. J. Intell. Manuf. 2022, 34, 2431–2447. [Google Scholar] [CrossRef]
Liang, Y.; Li, J.; Zhu, J.; Du, R.; Wu, X.; Chen, B. A Lightweight Network for Defect Detection in Nickel-Plated Punched Steel Strip Images. IEEE Trans. Instrum. Meas. 2023, 72, 3505515. [Google Scholar] [CrossRef]
Zhang, Z.; Zhou, M.; Wan, H.; Li, M.; Li, G.; Han, D. IDD-Net: Industrial defect detection method based on Deep Learning. Eng. Appl. Artif. Intell. 2023, 123, 106390. [Google Scholar] [CrossRef]
Yang, J.; Liu, S.; Wu, J.; Su, X.; Hai, N.; Huang, X. Pinwheel-Shaped Convolution and Scale-Based Dynamic Loss for Infrared Small Target Detection. Proc. AAAI Conf. Artif. Intell. 2025, 39, 9202–9210. [Google Scholar] [CrossRef]
Hu, J.; Bai, T.; Wu, F.; Peng, Z.; Zhang, Y. Pyramid Sparse Transformer: Efficient Multi-Scale Feature Fusion with Dynamic Token Selection. arXiv 2025, arXiv:2505.12772v2. [Google Scholar]
Huang, H.; Chen, Z.; Zou, Y.; Lu, M.; Chen, C.; Song, Y.; Zhang, H.; Yan, F. Channel Prior Convolutional Attention for Medical Image Segmentation. Comput. Biol. Med. 2024, 178, 108784. [Google Scholar] [CrossRef]
Dai, J.; Qi, H.; Xiong, Y.; Li, Y.; Zhang, G.; Hu, H.; Wei, Y. Deformable Convolutional Networks. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017. [Google Scholar] [CrossRef]
Available online: https://drive.google.com/open?id=1qrdZlaDi272eA79b0uCwwqPrm2Q_WI3k (accessed on 20 September 2024).
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. DETRs Beat YOLOs on Real-time Object Detection. arXiv 2023, arXiv:2304.08069. [Google Scholar] [CrossRef]
Das, A.K.; Leung, C.K.Y. A Novel Deep Learning-Based Technique for Efficient Characterization of Engineered Cementitious Composites Cracks for Durability Assessment. Struct. Concr. 2025, 26, 2107–2123. [Google Scholar] [CrossRef]
Ouyang, D.; He, S.; Zhang, G.; Luo, M. Efficient Multi-Scale Attention Module with Cross-Spatial Learning. arXiv 2023, arXiv:2305.13563. [Google Scholar]
Cai, X.; Lai, Q.; Wang, Y.; Wang, W.; Sun, Z.; Yao, Y. Poly Kernel Inception Network for Remote Sensing Detection. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition 2024, Seattle, WA, USA, 17–21 June 2024; pp. 27706–27716. [Google Scholar] [CrossRef]
Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Computer Vision—ECCV 2018; Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y., Eds.; Springer International Publishing: Cham, Switzerland, 2018; pp. 3–19. [Google Scholar]
Xu, W.; Wan, Y. ELA: Efficient Local Attention for Deep Convolutional Neural Networks. arXiv 2024, arXiv:2403.01123v1. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Albanie, S.; Sun, G.; Wu, E. Squeeze-and-Excitation Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 2011–2023. [Google Scholar] [CrossRef] [PubMed]
Lau, K.W.; Po, L.M.; Rehman, Y.A.U. Large Separable Kernel Attention: Rethinking the Large Kernel Attention Design in CNN. Expert Syst. Appl. 2024, 236, 121352. [Google Scholar] [CrossRef]

Figure 1. The Network Architecture of PSC-YOLO.

Figure 2. Schematic diagram of the pinwheel-shaped convolution (PConv).

Figure 3. Overview of the Pyramid Sparse Transformer.

Figure 4. PST-DET architecture.

Figure 5. The Channel Prior Convolutional Attention Mechanism (CPCA).

Table 1. Statistics of Defect Categories in the NEUDet Dataset.

Defect Category	Number of Instances	Percentage
crazing	689	16.45%
inclusion	1011	24.13%
patches	881	21.03%
pitted_surface	432	10.31%
rolled-in_scale	628	14.99%
scratches	548	13.08%
Total	4189	100%

Table 2. Performance Comparison of Different Models.

Methods	Precision	Recall	mAP0.5	mAP0.5:0.95	Params/M	GFLOPs	ImgLatency/ms
YOLOv5n	0.740	0.737	0.761 ± 0.003	0.460 ± 0.004	1.69	4.2	3.7
YOLOv7-tiny	0.746	0.671	0.722 ± 0.005	0.450 ± 0.001	5.74	13.1	2.3
YOLOv8n	0.708	0.714	0.739 ± 0.016	0.449 ± 0.010	2.87	8.1	2.2
YOLOv9t	0.742	0.734	0.756 ± 0.011	0.459 ± 0.014	2.50	10.7	3.4
YOLOv10n	0.711	0.714	0.741 ± 0.004	0.425 ± 0.001	2.57	8.2	1.8
YOLOv11n	0.758	0.720	0.752 ± 0.010	0.458 ± 0.005	2.46	6.3	2.3
RT-DETR R18	—	0.728	0.799 ± 0.008	0.487 ± 0.002	21.86	29.70	23.59
PSC-YOLO (ours)	0.763	0.761	0.783 ± 0.009	0.483 ± 0.013	2.92	5.7	2.8

Note: To evaluate model stability on the small-scale dataset, all experiments were independently run three times. In this table, mAP values are reported as mean ± standard deviation; due to column width constraints, only the mean values of Pr and Re are listed, which are also derived from the three experimental runs. The same convention applies to Table 3, Table 4, Table 5 and Table 6.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Gao, S.; Guo, X.; Wu, C.; Chen, M.; Yu, G. A Steel Defect Detection Model Enhanced by Pinwheel-Shaped Convolution and Pyramid Sparse Transformer. Symmetry 2025, 17, 2085. https://doi.org/10.3390/sym17122085

AMA Style

Gao S, Guo X, Wu C, Chen M, Yu G. A Steel Defect Detection Model Enhanced by Pinwheel-Shaped Convolution and Pyramid Sparse Transformer. Symmetry. 2025; 17(12):2085. https://doi.org/10.3390/sym17122085

Chicago/Turabian Style

Gao, Shuangxi, Xinqi Guo, Chao Wu, Miao Chen, and Gui Yu. 2025. "A Steel Defect Detection Model Enhanced by Pinwheel-Shaped Convolution and Pyramid Sparse Transformer" Symmetry 17, no. 12: 2085. https://doi.org/10.3390/sym17122085

APA Style

Gao, S., Guo, X., Wu, C., Chen, M., & Yu, G. (2025). A Steel Defect Detection Model Enhanced by Pinwheel-Shaped Convolution and Pyramid Sparse Transformer. Symmetry, 17(12), 2085. https://doi.org/10.3390/sym17122085

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Steel Defect Detection Model Enhanced by Pinwheel-Shaped Convolution and Pyramid Sparse Transformer

Abstract

1. Introduction

2. Methods

2.1. Network Architecture

2.2. Pinwheel-Shaped Convolution (PConv)

2.3. Pyramid Sparse Transformer (PST) Module

2.4. Channel Prior Convolutional Attention (CPCA) Mechanism

3. Experiment

3.1. Dataset

3.2. Evaluation Metrics

3.3. Experimental Environment and Parameter Settings

3.4. Experimental Results and Discussion

3.5. Ablation Studies

3.5.1. Effectiveness of Individual Modules

3.5.2. Performance Analysis of PConv Placement Strategies

3.5.3. Performance Analysis of CPCA Placement Strategies

3.5.4. Comparative Analysis of Different Attention Mechanisms

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI