FAX-Net: An Enhanced ConvNeXt Model with Symmetric Attention and Transformer-FPN for Steel Defect Classification

Jiang, Yan; Dai, Jiaxin; Jiang, Zhuoru

doi:10.3390/sym17081313

Open AccessArticle

FAX-Net: An Enhanced ConvNeXt Model with Symmetric Attention and Transformer-FPN for Steel Defect Classification

by

Yan Jiang

¹,

Jiaxin Dai

^1,*

and

Zhuoru Jiang

²

¹

School of Software, Shenyang University of Technology, Shenyang 110870, China

²

School of Information Science and Engineering, Shenyang University of Technology, Shenyang 110870, China

^*

Author to whom correspondence should be addressed.

Symmetry 2025, 17(8), 1313; https://doi.org/10.3390/sym17081313

Submission received: 3 July 2025 / Revised: 23 July 2025 / Accepted: 11 August 2025 / Published: 13 August 2025

(This article belongs to the Section Computer)

Download

Browse Figures

Versions Notes

Abstract

In the steel manufacturing process, defect classification is a critical step to ensure product performance and safety. However, due to the complexity of defect types and their multi-scale distribution characteristics, surface defect classification for steel plates remains a significant challenge. To address this issue, this paper proposes a deep learning model based on the ConvNeXt architecture, FAX-Net, which is designed to further improve the accuracy of steel surface defect classification. The FAX-Net architecture incorporates a Symmetric Dual-dimensional Attention Module (SDAM), which employs structurally symmetric and parallel modeling paths to effectively enhance the model’s responsiveness to critical defect regions. In addition, a Transformer-Fused Feature Pyramid Network (TF-FPN) is designed by integrating a lightweight Transformer to improve information interaction and integration across features of different scales, thereby enhancing the model’s discriminative capability in multi-scale scenarios. Experimental results demonstrate that the proposed FAX-Net model offers significant advantages in steel surface defect classification tasks. On the NEU-CLS dataset, FAX-Net achieves a classification accuracy of 97.78%, outperforming existing mainstream methods. These findings validate that FAX-Net possesses superior classification capabilities and is well-suited to handle a wide variety of defect types and scales effectively.

Keywords:

defect classification; ConvNeXt; symmetric attention; feature fusion

1. Introduction

Steel plates, as fundamental structural materials in modern industry, are widely used across various sectors such as construction, transportation, and machinery manufacturing [1,2]. However, during actual production, various factors such as manufacturing processes, raw material quality, and production environments may lead to the formation of surface defects on steel plates, which can compromise their quality and service life and potentially pose safety risks [3,4]. Therefore, the detection and classification of surface defects on steel plates are of great significance.

Early methods for defect classification typically relied on hand-crafted image features, using classifier models to categorize surface defects on steel plates [5]. These approaches are effective in scenarios with well-defined features and ample data samples, but they remain limited in handling complex backgrounds and detecting subtle defects [6]. With the rapid advancement of deep learning technologies, convolutional neural networks (CNNs) have achieved groundbreaking results in image classification tasks and have been widely adopted in the field of steel surface defect classification [7,8]. Representative CNN models such as VGG [9], ResNet [10], and EfficientNet [11] are capable of automatically learning hierarchical and discriminative features directly from raw images, significantly improving classification accuracy. Compared with traditional methods based on manually engineered features, deep learning-based defect classification approaches demonstrate clear advantages in terms of accuracy and have become the mainstream direction and a major research focus in steel surface defect classification [12].

In recent years, ConvNeXt [13], a convolutional model that incorporates modern architectural design principles, has gained considerable attention from researchers in the field of steel surface defect detection. Owing to its efficient capability in capturing multiscale textures and boundary features, ConvNeXt demonstrates more pronounced performance advantages compared to earlier deep convolutional neural network models as well as recent popular vision Transformer (ViT) architectures such as Vision Transformer [14] and Swin Transformer [15].

Although ConvNeXt has gained popularity in steel surface defect classification tasks, several challenges remain. On the one hand, ConvNeXt focuses more on aggregating global information, which may lead to the omission of critical local details such as subtle defects and fine-grained structures [13]. To address this limitation, this paper introduces an improved Symmetric Dual-dimensional Attention Module (SDAM) into the ConvNeXt backbone. Inspired by the channel and spatial attention mechanisms in the Convolutional Block Attention Module (CBAM) [16], SDAM adopts a structurally symmetric and parallel design that independently models salient features along both the channel and spatial dimensions. This symmetric and parallel architecture avoids the sequential bias inherent in cascaded attention mechanisms, thereby enhancing the integrity and discriminative capability of feature representations. As a result, the model’s ability to attend to local regions and respond to defect-prone areas in both channel and spatial dimensions is significantly improved.

On the other hand, ConvNeXt exhibits limited feature fusion capabilities when dealing with defects of varying scales, particularly under complex backgrounds or in scenarios involving large disparities in defect sizes. This may lead to insufficient expression of multi-scale features. To mitigate this limitation, this paper proposes the Transformer-Fused Feature Pyramid Network (TF-FPN), which integrates a Feature Pyramid Network (FPN) [17] into the ConvNeXt backbone to strengthen the fusion of multi-scale features. Moreover, a lightweight Transformer is introduced after the multi-scale feature fusion stage in the Feature Pyramid Network (FPN) to enhance global modeling capability across scales and improve the efficiency of contextual information propagation. This design further strengthens the model’s ability to identify defects that are sparsely distributed or structurally ambiguous. Here, “lightweight” refers to a single-layer Transformer with eight attention heads and a hidden size of 256, introducing only ~1.42 million additional parameters—significantly fewer than those of conventional multi-layer Transformer architectures.

In summary, this paper proposes a method named FAX-Net for the classification of surface defects on steel plates, which is built upon the ConvNeXt architecture and incorporates SDAM and TF-FPN to enhance classification accuracy by improving the model’s ability to capture critical local details and integrate multi-scale features. The main contributions of this paper are as follows:

(1): The introduction of FAX-Net, which presents a novel method for the classification of surface defects on steel plates.
(2): The introduction of the SDAM, which enhances the model’s ability to focus on local defect regions.
(3): TF-FPN is proposed, which integrates a lightweight Transformer to improve the model’s classification performance on defects with large scale variations and complex structures.

The remainder of this paper is organized as follows. Section 2 reviews related work on defect classification methods, attention mechanisms, and feature fusion strategies. Section 3 provides a detailed description of the FAX-Net architecture, including the design of its key components—the SDAM and TF-FPN. Section 4 presents the experimental setup, including the dataset, evaluation metrics, and results. Finally, Section 5 concludes the paper.

2. Related Work

2.1. Defect Inspection Approaches

Early classical methods for steel surface defect classification typically comprise two components: a feature extractor and a pattern recognizer [18]. The former is responsible for extracting discriminative handcrafted features such as texture, edges, or shapes from images—for example, Local Binary Patterns (LBPs) [19,20]; the latter processes and analyzes the extracted feature patterns and classifies them into different defect types, using techniques such as employing Support Vector Machines (SVMs) [21]. However, these traditional approaches heavily rely on manual expertise and often suffer from limited generalization capability, high computational complexity, and insufficient classification accuracy. As a result, they struggle to meet the demands for efficient and accurate defect classification in the steel industry [22,23].

In contrast, deep learning methods [24], which rely on neural networks, can automatically learn hierarchical and more expressive features directly from raw images. This eliminates the need for labor-intensive and experience-dependent manual feature design, while also enabling the model to adapt to more complex and diverse defect patterns, thereby significantly improving classification accuracy [25]. Bouguettaya et al. [26] evaluated the performance of various pretrained CNN architectures, including VGG-16, ResNet50, and MobileNet-V3, on steel surface defect classification tasks. Li et al. [27] proposed a hybrid network architecture called CNN-T, which combines CNNs and Transformers to address the limitations of traditional CNNs in capturing global features of steel surface defect images, thereby enhancing classification performance. Jeong et al. [28] introduced a hybrid deep learning model named Hybrid-DC, which integrates ResNet-50 with the ViT to achieve high precision steel defect classification. Vasan et al. [29] proposed a ViT-based approach for detecting and classifying defects on hot-rolled steel surfaces. Zhang et al. [30] enhanced the ConvNeXt architecture and introduced a model named ESG-ConvNeXt, which demonstrated strong generalization performance when evaluated on the NEU-CLS-64 hot-rolled strip steel surface defect dataset.

2.2. Approaches Applications of Attention Mechanisms in Surface Defect Detection

Attention mechanisms dynamically adjust feature weights [31], effectively enhancing a model’s ability to represent target regions and improving the discriminability of features. These mechanisms have been widely adopted in image classification tasks [32] and object detection tasks [33], significantly improving the performance of deep learning models in complex scenarios. Cui et al. [34] integrated an attention mechanism into the ResNet-50 model and evaluated it on the ImageNet-1K dataset, achieving a significant reduction in error rates. Luo et al. [35] proposed a network model named AA-ResNet, which was evaluated on datasets such as CIFAR-10, Caltech-101, and Caltech-256. By incorporating an adaptive attention mechanism, the model demonstrated improved feature representation and classification accuracy. Yang et al. [36] introduced a novel attention mechanism called Focal Self-Attention, which combines local and global attention to enhance feature extraction. This approach achieved higher Top-1 accuracy and mAP on tasks such as ImageNet classification and COCO object detection. Feng et al. [37] incorporated the CBAM into the ResNet-50 architecture, thereby improving the accuracy of steel defect classification. Shi et al. [38] integrated CBAM into the YOLOv5 algorithm for steel surface defect detection, which enhanced the detection precision of small targets. Zhang et al. [39] proposed a method that fuses multi-scale local and global features, introducing the CBAM into the feature fusion network to enhance robustness against complex backgrounds and improve the accuracy of defect detection. Zhou et al. [40] presented a hybrid attention network that incorporates the CBAM attention module to strengthen the model’s ability to learn effective features, resulting in improved detection accuracy.

Although the CBAM has achieved promising results in the field of computer vision, its sequential modeling of channel and spatial attention may lead to processing dependencies during feature extraction. The dependencies can hinder comprehensive information integration, thereby limiting the full representation and collaborative optimization of multi-dimensional features. To overcome this limitation, this paper proposes the SDAM, which innovatively adopts a symmetric decoupled structure that models channel saliency and spatial saliency in parallel. This parallel design effectively avoids the order dependency inherent in traditional sequential structures and eliminates the associated representational bias.

2.3. Feature Fusion Techniques for Surface Defect Inspection

Feature fusion is a technique that integrates feature representations from different hierarchical layers of a neural network [41], and it is particularly effective for addressing multi-scale challenges [42]. In steel surface defect detection and classification tasks, the incorporation of Feature Pyramid Networks (FPNs) enhances the model’s ability to perceive defects of various scales and shapes, thereby improving detection accuracy. Tang et al. [43] proposed a steel surface defect detection method based on the Swin Transformer, in which an FPN was integrated to significantly improve the detection capability for different types and sizes of surface defects—especially in the case of small-sized defects, where detection accuracy was notably enhanced. Liu et al. [44] introduced a surface damage detection method for retired steel shafts (RSSs) based on Faster R-CNN. In this approach, the FPN was incorporated and the Region Proposal Network (RPN) was adapted to handle multi-scale information. Compared with the original Faster R-CNN, this method achieved higher average detection accuracy in RSS surface damage detection. Yu et al. [45] proposed an optimized multi-scale feature fusion model named CRGF-YOLO, which integrates a simplified version of the Generalized Feature Pyramid Network (GFPN) into the YOLOv5 architecture. This enhancement improved both the detection accuracy and generalization capability of the model for steel surface defect detection tasks.

The traditional FPN is constrained by the inherent locality of convolution operations, making it difficult to capture global contextual dependencies across different scales. To address this limitation, we propose TF-FPN, which introduces a lightweight Transformer module after the multi-scale feature fusion stage of the FPN. This design enables long-range interactions among multi-scale features, thereby enhancing the representation of cross-scale contextual information.

3. Method

In this chapter, we provide a detailed introduction to our proposed method, FAX-Net. Section 3.1 presents the overall architecture of the network, including the basic configuration of the backbone ConvNeXt and the integration of the proposed modules. Section 3.2 focuses on the design and improvements of the proposed SDAM. Section 3.3 further elaborates on the implementation of the TF-FPN module. Section 3.4 describes the training implementation details of the model, including the loss function and training setup.

3.1. The Framework of the Proposed FAX-Net

In this study, we propose an improved convolutional neural network architecture for the task of steel surface defect image classification. The network is built upon ConvNeXt as the backbone, into which the SDAM and TF-FPN modules are integrated. This design balances the capability of capturing local fine details with global contextual modeling. Unlike conventional convolutional networks, the proposed model aims to simultaneously enhance attention to key feature regions, strengthen multi-scale feature representation, and improve long-range dependency modeling. This makes it particularly suitable for industrial surface image classification tasks characterized by complex textures and blurred boundaries. The overall architecture of the network is illustrated in Figure 1. The network consists of four principal stages (Stage 1 to Stage 4). Initially, a 4 × 4 convolution with a stride of 4 is applied to downsample the input into a 224 × 224 feature map, followed by layer normalization. Each stage comprises a Conv2D operation, GELU activation, the CSAM-P attention mechanism, and layer normalization, all integrated with residual connections. Downsampling is performed between successive stages. Upon completion of Stage 4, multi-scale feature maps (e.g., 56 × 56, 28 × 28) are processed by the TF-FPN module using 7 × 7 convolutions and subsequently fused via ⊗ operations. The fused features are then passed to a Transformer module. Finally, the output is generated through global average pooling, layer normalization, and a linear projection layer.

In the backbone network, ConvNeXt adopts a pure convolutional architecture while incorporating several design strategies inspired by the ViT, such as large convolutional kernels (7 × 7), GELU activation functions, and LayerNorm. These enhancements significantly improve the depth and representational power of feature extraction without introducing attention mechanisms. In this study, the four-stage output structure of ConvNeXt is retained, producing feature maps with channel dimensions of 96, 192, 384, and 768, respectively. These multi-level semantic features serve as the foundation for subsequent modules.

To further guide the network to focus on more discriminative channel and spatial regions for the classification task, an SDAM is embedded after each block of the ConvNeXt backbone. This module first extracts statistical information along the channel dimension using global average pooling and max pooling to generate a channel attention map. A 1 × 1 convolution followed by a nonlinear activation function is then applied to model inter channel dependencies. For the spatial attention, the module constructs a saliency map based on the mean and max values along the channel axis and applies a convolution layer with a large receptive field to produce the spatial attention map. The channel and spatial attention maps are then fused via element-wise multiplication and applied to the input feature map. A residual connection is used to retain the original features. To enhance the adaptability of this mechanism across different image samples, two learnable scaling parameters, α and β, are introduced to adaptively modulate the channel and spatial attention, respectively. This allows the attention mechanism to remain flexible and effective across different network stages.

To effectively fuse multi-scale features from different depths, the TF-FPN module is introduced into the model. Drawing inspiration from the FPN architecture, this module first applies 1 × 1 convolutions to the output feature maps from all four stages of ConvNeXt to unify the channel dimensions to 256. Subsequently, a top–down pyramid structure is used to hierarchically fuse features: high-level feature maps are upsampled to match the resolution of the lower-level maps and are then fused via element-wise addition with their corresponding lower-level features. Finally, a 3 × 3 convolution is applied to refine the fused features at each scale. This design effectively preserves low-level spatial details and high-level semantic information, mitigates the information loss caused by resolution discrepancies across feature maps, and provides a structurally consistent input for subsequent global modeling.

Building upon the above modules, a lightweight Transformer layer is further introduced to enhance the modeling of global contextual information. The Transformer module first rearranges the 2D feature map into a sequential format suitable for self-attention operations. After Layer Normalization, the sequence is fed into a multi-head self-attention module, where global attention weights are computed to enable cross region feature interaction. This is followed by a feed-forward neural network (FFN) that further transforms the attended features. Residual connections and additional layer normalization are applied throughout to stabilize training and improve the model’s generalization capability. Finally, the output features are aggregated via global average pooling to generate an image-level representation, which is then passed through a fully connected layer to produce the final defect classification output.

FAX-Net adopts a stacked integration of the SDAM and TF-FPN to achieve an effective synergy between local refinement and global contextual modeling. The SDAM, embedded across all stages of the ConvNeXt backbone, employs parallel channel and spatial attention to enhance salient features, thereby improving sensitivity to fine-grained and small-scale defects. The refined features are then passed to TF-FPN, which fuses multi-scale information and leverages a lightweight Transformer to capture long-range dependencies across spatial and semantic levels. This progressive interaction—from local discrimination via the SDAM to global integration via TF-FPN—significantly enhances the network’s representational capacity and is central to the superior performance of FAX-Net.

3.2. Symmetric Dual-Dimensional Attention Module

In CNNs, attention mechanisms enhance model performance by assigning different weights to various regions of the feature map, which is particularly beneficial in visual tasks. This study proposes an improved convolutional block attention module, termed the SDAM, which integrates the strengths of the original CBAM while introducing a symmetric architecture, a learnable weighting mechanism, and residual connections to further optimize attention modeling. The aim is to enhance the model’s ability to perceive critical features. A schematic illustration of the SDAM is shown in Figure 2. The module comprises symmetric and parallel branches for channel and spatial attention, which independently model the saliency information along the channel and spatial dimensions, respectively. The channel attention branch generates attention weights through global pooling operations followed by a shared multi-layer perceptron (MLP), while the spatial attention branch derives spatial weights based on the weighted feature maps. The outputs of both branches are subsequently fused to produce an enhanced and more discriminative feature representation.

The input feature map is a 3D tensor, where H denotes the height, W the width, and C the number of channels. The conventional CBAM applies independent channel and spatial attention mechanisms to weight the feature map along the channel and spatial dimensions, respectively. However, to enhance the adaptability of the model, this study introduces an improved version of the original CBAM by incorporating two learnable weighting parameters, α and β, which dynamically adjust the contributions of the channel and spatial attention mechanisms, respectively.

The core idea of the channel attention mechanism is to weight the feature maps across different channels based on their global contextual information, thereby enabling the network to focus more on the most informative channel features. To achieve this, global pooling operations are applied to the input feature map, where both average pooling and max pooling are used to capture global average and maximum information, respectively, as shown in Equation (1).

X_{avg} = AvgPool (X), X_{\max} = MaxPool (X)

(1)

The two pooled feature maps are then concatenated to form a new tensor,

X_{c o n c a t} \in R^{H \times W \times 2 C}

. This tensor is subsequently fed into a two-layer fully connected (FC) network (or convolutional layers) to learn the importance of each channel. The first layer reduces the dimensionality of the concatenated features, while the second layer restores the original number of channels, C. Finally, a Sigmoid activation function is applied to produce the channel attention map, as defined in Equation (2).

M_{c} = σ (W_{2} \cdot ReLU (W_{1} \cdot (X_{avg} + X_{\max})))

(2)

Here,

W_{1}

and

W_{2}

are the weight matrices of the two fully connected layers,

σ

denotes the Sigmoid activation function, and the resulting channel attention map

M_{c}

is used to weight each channel of the input feature map.

To enhance the adaptability of the model, we introduce a learnable scaling parameter,

α

, to weight the channel attention map, as defined in Equation (3).

X_{ca} = α \cdot M_{c} \cdot X

(3)

Here,

X_{ca}

represents the feature map after channel weighting, and

α

is a learnable parameter used to control the importance of the channel attention.

The goal of the spatial attention mechanism is to weight different spatial locations of the input feature map based on the importance of regions along the spatial dimension. To achieve this, global pooling operations are first applied to the input feature map, as shown in Equation (1), to obtain a global spatial information representation.

Next, the pooled feature maps

X_{avg}

and

X_{\max}

are concatenated to form a new tensor,

X_{c o n c a t} \in R^{H \times W \times 2 C}

, which is then passed through a convolutional operation to generate the spatial attention map

M_{s} \in R^{H \times W}

, as shown in Equation (4).

M_{s} = σ (W_{s} \cdot X_{concat})

(4)

Here,

W_{s}

denotes the convolutional kernel, and

σ

represents the Sigmoid activation function. The resulting spatial attention map

M_{s}

reflects the importance of each spatial location in the input feature map.

Unlike the traditional CBAM, we introduce a learnable weighting parameter,

β

, to adaptively adjust the spatial attention map, as defined in Equation (5).

X_{sa} = β \cdot M_{s} \cdot X

(5)

Here,

β

is a learnable parameter used to control the importance of spatial attention, and

X_{sa}

represents the feature map after spatial weighting.

In the improved CBAM, the channel attention map and spatial attention map are not applied independently to the feature map. Instead, they are combined through element-wise multiplication. Specifically, the channel attention map

X_{ca}

and the spatial attention map

X_{sa}

are multiplied to generate the final attention map, as defined in Equation (6).

X_{final} = X_{ca} \cdot X_{sa}

(6)

This mechanism maintains the structural symmetry, computational independence, and balanced integration between the channel and spatial attention branches, enabling the model to simultaneously capture semantic features along the channel dimension and positional cues along the spatial dimension without being affected by processing order. In this design, the shallow stages rely more on fine-grained texture and edge information, resulting in relatively higher α values. In contrast, the deeper stages focus increasingly on semantic structures, with β values gaining prominence. This adaptive weighting across semantic depths reflects the effective coordination between channel and spatial attention, thereby enhancing the representational discriminability of the model.

3.3. Feature Fusion Module

In this paper, the integration of the FPN and Transformer forms the core of the model design. The FPN provides multi-scale feature maps, while the Transformer further enhances global contextual modeling based on these features. Through this combination, the model achieves significant performance improvements in handling multi-scale objects and capturing global dependencies.

Specifically, let the feature maps extracted from different layers of the backbone network be denoted as

C_{2}, C_{3}, C_{4}, C_{5}

, where

C_{i} \in R^{H_{i} \times W_{i} \times D_{i}}

represents the feature map from the i layer,

H_{i}

and

W_{i}

denote the height and width, respectively, and

D_{i}

is the number of channels. The FPN applies lateral convolution operations to these feature maps, as defined in Equation (7).

P_{i} = {Conv}_{1 \times 1} (C_{i}), i = 2, 3, 4, 5

(7)

Then, top–down feature propagation is performed through layer-by-layer upsampling and feature fusion, as defined in Equation (8).

P_{i} = P_{i} + Upsample (P_{i + 1}), i = 2, 3, 4, 5

(8)

Subsequently, each layer of the feature map is further smoothed using a convolution operation denoted as

3 \times 3

, resulting in a feature pyramid,

P_{2}, P_{3}, P_{4}, P_{5}

, under a unified semantic scale. These feature maps capture multi-scale information ranging from high-level abstractions to low-level details.

Subsequently, the multi-scale feature maps output by the FPN are used as input to the Transformer module for global contextual modeling. To handle these features uniformly, each scale’s feature map,

P_{i}

, is flattened and linearly projected into a sequence of vectors, as defined in Equation (9).

X_{i} = Flatten (Proj (P_{i})) \in R^{N_{i} \times d}

(9)

Here

N_{i} = H_{i} \times W_{i}

, d represent the embedding dimensions of the Transformer module. Subsequently, the sequence vectors from all scales are concatenated into a single input sequence,

X

, as defined in Equation (10).

X = [X_{2}; X_{3}; X_{4}; X_{5}] \in R^{N \times d}, N = \sum_{i = 2}^{5} N_{i}

(10)

To preserve the spatial structure of the fused multi-scale features, the model relies on the Transformer’s inherent positional encoding mechanism when passing the concatenated sequence into the Transformer module. This mechanism introduces implicit spatial cues into the token representations, thereby enabling the Transformer to infer the relative positions of features within the original maps. Consequently, the model is better equipped to capture long-range dependencies and global contextual relationships, as illustrated in Equation (11).

\tilde{X} = X + E

(11)

The Transformer uses a multi-head self-attention mechanism, where each attention head is computed as defined in Equation (12).

Attention (Q, K, V) = softmax (\frac{Q K^{⊤}}{\sqrt{d}}) V

(12)

Here, Q, K, and V are the linear transformations of the input sequence

\tilde{X}

, and d is the scaling factor. Multi-head attention captures dependencies between different positions in parallel, thereby enhancing the model’s understanding of the global structure of the image.

The multi-scale feature maps output by the FPN provide the Transformer with richer input information, containing semantic features from different hierarchical levels. Under the self-attention mechanism of the Transformer layers, these feature maps can better capture dependencies between different spatial positions within the image, especially excelling in global semantic consistency and spatial context modeling. In this way, the Transformer not only enhances the global modeling capability of the features but also further improves the semantic representation effectiveness of the features extracted by the FPN.

This integration enables the model to extract rich information from multiple scales and to perform unified modeling of this information through the self-attention mechanism, thereby exhibiting stronger discriminative capability in complex scenes. The FPN enhances the model’s adaptability to multi-scale objects, while the Transformer improves its ability to model complex patterns and long-range dependencies through global information interaction.

3.4. Training Implementation

The loss function employed in our approach is the cross-entropy loss, as defined in Equation (13).

L (y, \hat{y}) = - \sum_{c = 1}^{C} y_{c} \log ({\hat{y}}_{c})

(13)

In the definition of the cross-entropy loss function, y represents the discrete probability distribution of the true labels, which is typically characterized as a one-hot encoded vector. Its component

y_{c} \in {0, 1}

acts as an indicator for class c, taking a value of 1 if and only if the sample belongs to class c.

In contrast,

\hat{y}

corresponds to the predicted probability distribution output by the model. It satisfies

{\hat{y}}_{c} \in [0, 1]

and

{\sum_{c = 1}^{C} \hat{y}}_{c} = 1

, where the component

{\hat{y}}_{c}

denotes the model’s probability estimate for the sample belonging to class c.

Here,

C

signifies the total number of classes, and log denotes the natural logarithm (base e).

This function enables the quantitative evaluation of classification performance by measuring the discrepancy between the true distribution y and the predicted distribution

\hat{y}

.

All input images were resized to 224 × 224 resolution and converted to 3-channel RGB format. We implemented the model using PyTorch 2.2.1 with CUDA 12.7 acceleration, employing the Adam optimizer at a learning rate of 0.0001. Training was conducted for 100 epochs with a mini-batch size of 16 on a single NVIDIA GeForce RTX 4060 GPU.

4. Experiments

In this section, we describe how we systematically evaluated the performance of the proposed FAX-Net model on the task of steel surface defect classification through a series of experiments. Section 4.1 introduces the defect dataset used and the evaluation metrics. Section 4.2 analyzes the cross-entropy loss function employed in the model training process. In Section 4.3, we describe how we conducted a comparative analysis between FAX-Net and several mainstream classification models, demonstrating its classification accuracy across different defect types. Section 4.4 further verifies the practical contribution of the proposed key modules to performance improvement through ablation studies.

4.1. Defect Dataset and Evaluation Metrics

In this study, the NEU-CLS defect dataset [46] was used as the benchmark for steel surface defect classification experiments. Released by Northeastern University, this dataset contains six typical types of steel surface defects: Cracks (Cr), Inclusions (In), Patches (Pa), Pitted Surface (PS), Rolled-in Scale (RS), and Scratches (Sc), with sample images shown in Figure 3. Each defect category consists of 300 RGB images with a resolution of 200 × 200. To train and evaluate model performance, the selected defect samples were divided into training/testing sets in a ratio of 8:2.

In this paper, precision, recall, F1-score, and accuracy are adopted as evaluation metrics to assess the performance of the model. These metrics comprehensively reflect the model’s effectiveness from the perspectives of predictive accuracy and coverage capability. Class-wise accuracy is not reported, as it is relatively insensitive to false positives and false negatives and therefore provides limited interpretive value in such tasks. Specifically, precision measures the proportion of true positive samples among those predicted as positive, reflecting the reliability of the model’s positive predictions. Recall measures the proportion of correctly identified positive samples among all actual positive samples, indicating the model’s coverage of positive instances. F1-score is the harmonic mean of precision and recall, balancing both aspects and being particularly suitable for scenarios with imbalanced class distributions. Accuracy evaluates the proportion of correctly predicted samples among all samples, serving as a direct indicator of overall performance. The calculation formulas are shown in Equations (14)–(17).

Precision = \frac{TP}{TP + FP}

(14)

Recall = \frac{TP}{TP + FN}

(15)

F 1 - Score = 2 \times \frac{Precision \times Recall}{Precision + Recall}

(16)

Accuracy = \frac{TP + TN}{TP + FP + TN + FN}

(17)

True positive (TP) refers to the number of samples correctly predicted as positive by the model; false positive (FP) indicates the number of samples that are actually negative but incorrectly predicted as positive; true negative (TN) represents the number of samples correctly identified as negative; and false negative (FN) refers to the number of samples that are actually positive but incorrectly predicted as negative. These four metrics form the foundation for evaluating classification performance and support the calculation and analysis of the composite evaluation metrics such as precision, recall, F1-score, and accuracy.

4.2. Cross-Entropy Loss Analysis

In this paper, the cross-entropy loss function was employed to train the model. This loss function guides the optimization process by quantifying the discrepancy between the predicted class probability distribution and the ground truth labels. As illustrated in Figure 4, the loss value decreased rapidly during the initial training stages, indicating that the model effectively learned discriminative features from the input data. As training progressed, the loss gradually stabilized, suggesting that the model converged toward an optimal solution. Overall, both the training and validation loss curves exhibit consistent downward trends, with no evident signs of overfitting. In the later stages of training, the loss continued to decrease and eventually converged to a relatively low level, further demonstrating the model’s robustness and generalization capability. Although minor fluctuations are observed in some training epochs, the overall loss curve remains smooth, indicating that the training process was stable and the optimization was effective.

4.3. Comparative Experiments

To verify the effectiveness of the proposed FAX-Net model in the task of steel surface defect classification, we conducted systematic comparative experiments on the standard NEU-CLS dataset. The FAX-Net model was compared with several mainstream models, including ResNet50, VGG16, MobileNetV3, Vision Transformer, Swin Transformer, AlexNet, and EfficientNet. In addition, to further validate the effectiveness of the proposed modules, the backbone network used in this study, ConvNeXt, was also included as a baseline model for comparison.

Table 1 presents the classification performance of various models across six types of steel surface defects, including precision, recall, and F1-score. The results demonstrate that FAX-Net performed exceptionally well across all evaluation metrics, achieving an average precision of 97.42%, an average recall of 97.67%, and a high average F1-score of 97.51%. Compared to other mainstream models, FAX-Net shows more consistent and comprehensive recognition capabilities across all defect categories, highlighting its superior feature extraction and discrimination abilities.

Furthermore, as shown in Figure 5, FAX-Net attained the highest overall classification accuracy of 97.78%. By comparison, the next best-performing model, EfficientNet, achieved an accuracy of 88.70% and an F1-score of 88.63%—both substantially lower than those of FAX-Net. These results strongly confirm the significant advantage of the proposed model in steel surface defect classification.

In addition, FAX-Net demonstrated excellent recognition performance across all six categories of steel surface defects, with the best performance observed on the RS class, where both precision and recall reached 100%, and the F1-score was 97.30%. This result indicates that the proposed model possesses superior feature extraction and discrimination capabilities, especially when dealing with complex textures and easily confusable defect categories. It also highlights the model’s enhanced ability to identify subtle defects and samples with blurred boundaries.

4.4. Ablation Experiments

To investigate the impact of different modules on classification performance, we conducted ablation studies by individually incorporating the SDAM and TF-FPN module into the ConvNeXt backbone. The experimental results are presented in Table 2 and Figure 6, where ↑ indicates an improvement in the corresponding metric relative to the baseline model. Under the influence of these modules, our model achieved significant improvements across various evaluation metrics compared to the baseline. Specifically, the introduction of the SDAM led to notable performance gains for several defect types, particularly for the challenging categories Sc and In. The F1-score for Sc improved from 72.53% to 92.31%, and for In, it increased from 82.19% to 92.68%, achieving absolute gains of 19.78% and 10.49%, respectively. These results indicate that the SDAM effectively enhances the model’s response to key local regions in the feature maps, improving its discriminative ability for low-contrast and complex-texture defects such as inclusions.

The introduction of the TF-FPN module enables the model to more effectively recognize defect types with large scale variations and complex morphologies. It shows clear advantages particularly in categories such as Sc and PS, which are characterized by blurred local textures and indistinct boundaries. For the Sc category specifically, the F1-score increased from 72.53% (baseline) to 90.20%, representing an improvement of 17.67%. This significantly enhances the model’s discriminative capability for difficult-to-classify defects. Furthermore, TF-FPN effectively optimizes feature fusion for elongated edge features (e.g., Sc) and multi-scale defects, compensating for the limitations of traditional convolutional networks in modeling geometric features.

To assess the impact of the SDAM on defect-specific attention, Grad-CAM was employed to visualize the model’s focus regions, as illustrated in Figure 7. The results reveal that, for linear defects such as Cr and Sc, the model primarily attends to elongated edge structures; for region-based defects like In and Pa, attention is concentrated on the core areas with pronounced intensity variation; and for texture-complex defects such as PS and RS, the model highlights the transitional zones between defect and background. These observations confirm that the SDAM effectively guides the model to focus on the most informative regions based on the nature of the defect, thereby improving both classification accuracy and interpretability.

To further validate the effectiveness of the symmetric design in defect region modeling, we visualized the F1-score heatmaps for different models across each defect category, as shown in Figure 8. The classification performance of each model on different defect types can be intuitively compared through the color intensity. It can be clearly observed that, compared with the CBAM, the SDAM achieved performance improvements across most categories, with particularly significant gains in challenging and easily confusable classes such as Sc, In, and Cr. These improvements demonstrate that the symmetric parallel attention structure adopted by the SDAM can more effectively and evenly model salient information across both channel and spatial dimensions, thereby enhancing the model’s perception of complex structures and blurred-edge defects. Compared with the CBAM’s sequential modeling strategy, the SDAM maintains structural symmetry and independence between the channel and spatial attention branches, enabling the attention mechanism to capture multi-dimensional information more evenly and avoiding the interference caused by sequential bias—highlighting the advantages of the symmetric design in attention mechanisms.

5. Conclusions

We propose an improved model for steel surface defect classification, named FAX-Net, which is built upon the ConvNeXt backbone and integrates the SDAM and TF-FPN, thereby enhancing both the discriminative power and the completeness of feature representation. Comparative and ablation experiments on the NEU-CLS dataset demonstrate that FAX-Net outperforms existing mainstream methods across all classification metrics, confirming its effectiveness in defect classification tasks. Notably, FAX-Net exhibits excellent generalization ability when dealing with structurally complex and scale-varying defect types.

In the future, we will focus on the lightweight design and deployment optimization of the model. To meet the demands of real-time defect detection in industrial scenarios, subsequent work will explore techniques such as structural pruning, knowledge distillation, and quantization to reduce model parameters and computational overhead. Additionally, we will investigate the integration of graph optimization and efficient inference engines tailored for edge hardware platforms such as embedded GPUs, ARM processors, and FPGAs, aiming to enhance deployment efficiency and inference speed in resource-constrained environments. Concurrently, the generalizability of FAX-Net will be rigorously evaluated on additional steel defect datasets, such as extended NEU variants and domain-specific private datasets. Its applicability across diverse industrial scenarios, including cold-rolled steel plate and special alloy production lines, will be assessed to further demonstrate robustness and versatility across varying defect morphologies and manufacturing contexts.

Author Contributions

Conceptualization, Y.J. and J.D.; methodology, Y.J. and J.D.; software, J.D.; validation, J.D.; formal analysis, Y.J., J.D. and Z.J.; investigation, J.D. and Z.J.; resources, J.D.; data curation, J.D. and Z.J.; writing—original draft preparation, J.D.; writing—review and editing, J.D.; visualization, Y.J. and J.D.; supervision, J.D.; project administration, Y.J.; funding acquisition, Y.J. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The dataset is available at http://faculty.neu.edu.cn/songkechen/zh/CN/zdylm/263270/list/index.htm (accessed on 16 March 2025).

Acknowledgments

The authors sincerely thank those who have provided assistance with this manuscript.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Li, H.L.; Liu, M.; Yin, Y.F.; Sun, W.L. Steel surface defect detection based on multi-layer fusion networks. Sci. Rep. 2025, 15, 10371. [Google Scholar] [CrossRef] [PubMed]
Zhang, G.H.; Liu, S.X.; Nie, S.Q.; Yun, L.B. YOLO-RDP: Lightweight steel defect detection through improved YOLOv7-tiny and model pruning. Symmetry 2024, 16, 458. [Google Scholar] [CrossRef]
Li, R.Y.; Hou, K.L.; Zhu, M.W.; Dai, Q.M.; Ni, J.; Liu, X.L.; Li, X.Y. ACPP-Net: Enhancing Strip Steel Surface Defect Detection with Efficient Adaptive Convolution and Channel-Spatial Pyramid Pooling. IEEE Access 2024, 12, 152072–152086. [Google Scholar] [CrossRef]
Song, C.H.; Chen, J.X.; Lu, Z.; Li, F.; Liu, Y.Y. Steel surface defect detection via deformable convolution and background suppression. IEEE Trans. Instrum. Meas. 2023, 72, 5017709. [Google Scholar] [CrossRef]
Boikov, A.; Payor, V.; Savelev, R.; Kolesnikov, A. Synthetic data generation for steel defect detection and classification using deep learning. Symmetry 2021, 13, 1176. [Google Scholar] [CrossRef]
Zhao, W.D.; Chen, F.; Huang, H.C.; Li, D.; Cheng, W. A new steel defect detection algorithm based on deep learning. Comput. Intell. Neurosci. 2021, 2021, 5592878. [Google Scholar] [CrossRef]
Zhang, H.; Liu, H.; Guo, R.Y.; Liang, L.L.; Liu, Q.; Ma, W.L. ODNet: A high real-time network using orthogonal decomposition for few-shot strip steel surface defect classification. Sensors 2024, 24, 4630. [Google Scholar] [CrossRef]
Belila, D.; Khaldi, B.; Aiadi, O. Wavelet Texture Descriptor for Steel Surface Defect Classification. Materials 2024, 17, 5873. [Google Scholar] [CrossRef]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
He, K.M.; Zhang, X.Y.; Ren, S.Q.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27 June–1 July 2016; pp. 770–778. [Google Scholar]
Tan, M.X.; Le, Q. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. In Proceedings of the 36th International Conference on Machine Learning (ICML), Long Beach, CA, USA, 10–15 June 2019; pp. 6105–6114. [Google Scholar]
Wen, X.; Shan, J.R.; He, Y.; Song, K.C. Steel Surface Defect Recognition: A Survey. Coatings 2023, 13, 17. [Google Scholar] [CrossRef]
Liu, Z.; Mao, H.Z.; Wu, C.Y.; Feichtenhofer, C.; Darrell, T.; Xie, S.N. A ConvNet for the 2020s. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 19–24 June 2022; pp. 11976–11986. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.H.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image Is Worth 16×16 Words: Transformers for Image Recognition at Scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Liu, Z.; Lin, Y.T.; Cao, Y.; Hu, H.; Wei, Y.X.; Zhang, Z.; Lin, S.; Guo, B.N. Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 10012–10022. [Google Scholar]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.M.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
He, Y.; Li, S.; Wen, X.; Xu, J. A High-Quality Sample Generation Method for Improving Steel Surface Defect Inspection. Sensors 2024, 24, 2642. [Google Scholar] [CrossRef]
Luo, Q.W.; Sun, Y.C.; Li, P.C.; Simpson, O.; Tian, L.; He, Y.G. Generalized Completed Local Binary Patterns for Time-Efficient Steel Surface Defect Classification. IEEE Trans. Instrum. Meas. 2018, 68, 667–679. [Google Scholar] [CrossRef]
Luo, Q.W.; Fang, X.X.; Liu, L.; Yang, C.H.; Sun, Y.C. Automated Visual Defect Detection for Flat Steel Surface: A Survey. IEEE Trans. Instrum. Meas. 2020, 69, 626–644. [Google Scholar] [CrossRef]
Feng, X.L.; Gao, X.W.; Luo, L. X-SDD: A New Benchmark for Hot Rolled Steel Strip Surface Defects Detection. Symmetry 2021, 13, 706. [Google Scholar] [CrossRef]
Han, L.; Li, N.; Li, J.H.; Gao, B.B.; Niu, D. SA-FPN: Scale-Aware Attention-Guided Feature Pyramid Network for Small Object Detection on Surface Defect Detection of Steel Strips. Measurement 2025, 249, 117019. [Google Scholar] [CrossRef]
Geng, R.R.; Wang, H.H.; Hu, H.Y.; Shi, T. AFD-YOLOv10: A Lightweight Method for Non-Destructive Testing of Fusion Weld Seam Defects. Symmetry 2025, 17, 886. [Google Scholar] [CrossRef]
Janiesch, C.; Zschech, P.; Heinrich, K. Machine Learning and Deep Learning. Electron. Markets 2021, 31, 685–695. [Google Scholar] [CrossRef]
Yang, S.X.; Xie, Y.; Wu, J.Q.; Huang, W.D.; Yan, H.S.; Wang, J.Y.; Wang, B.; Yu, X.C.; Wu, Q.; Xie, F. CFE-YOLOv8s: Improved YOLOv8s for Steel Surface Defect Detection. Electronics 2024, 13, 2771. [Google Scholar] [CrossRef]
Bouguettaya, A.; Zarzour, H. CNN-Based Hot-Rolled Steel Strip Surface Defects Classification: A Comparative Study Between Different Pre-Trained CNN Models. Int. J. Adv. Manuf. Technol. 2024, 132, 399–419. [Google Scholar] [CrossRef]
Li, S.F.; Wu, C.X.; Xiong, N.X. Hybrid Architecture Based on CNN and Transformer for Strip Steel Surface Defect Classification. Electronics 2022, 11, 1200. [Google Scholar] [CrossRef]
Jeong, M.; Yang, M.; Jeong, J. Hybrid-DC: A Hybrid Framework Using ResNet-50 and Vision Transformer for Steel Surface Defect Classification in the Rolling Process. Electronics 2024, 13, 4467. [Google Scholar] [CrossRef]
Vasan, V.; Sridharan, N.V.; Vaithiyanathan, S.; Aghaei, M. Detection and Classification of Surface Defects on Hot-Rolled Steel Using Vision Transformers. Heliyon 2024, 10, e38498. [Google Scholar] [CrossRef]
Zhang, N.; Liu, Z.Y.; Zhang, E.X.; Chen, Y.Q.; Yue, J. An ESG-ConvNeXt Network for Steel Surface Defect Classification Based on Hybrid Attention Mechanism. Sci. Rep. 2025, 15, 10926. [Google Scholar] [CrossRef]
Zhang, Y.; Li, X.Q.; Chen, W.Y.; Zang, Y. Image Classification Based on Low-Level Feature Enhancement and Attention Mechanism. Neural Process. Lett. 2024, 56, 217. [Google Scholar] [CrossRef]
Wang, Q.L.; Wu, B.G.; Zhu, P.F.; Li, P.H.; Zuo, W.M.; Hu, Q.H. ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020; pp. 11534–11542. [Google Scholar]
Dai, X.Y.; Chen, Y.P.; Xiao, B.; Chen, D.D.; Liu, M.C.; Yuan, L.; Zhang, L. Dynamic Head: Unifying Object Detection Heads with Attentions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Virtual, 19–25 June 2021; pp. 7373–7382. [Google Scholar]
Cui, X.L.; Zhang, Z.J.; Zhang, T.; Yang, Z.Q.; Yang, J. Attention Graph: Learning Effective Visual Features for Large-Scale Image Classification. J. Algorithms Comput. Technol. 2022, 16, 17483026211065375. [Google Scholar] [CrossRef]
Luo, Q.W.; Jiang, W.Q.; Su, J.J.; Ai, J.Q.; Yang, C.H. Smoothing Complete Feature Pyramid Networks for Roll Mark Detection of Steel Strips. Sensors 2021, 21, 7264. [Google Scholar] [CrossRef] [PubMed]
Yang, J.W.; Li, C.Y.; Zhang, P.C.; Dai, X.Y.; Xiao, B.; Yuan, L.; Gao, J.F. Focal Self-Attention for Local-Global Interactions in Vision Transformers. arXiv 2021, arXiv:2107.00641. [Google Scholar]
Feng, X.L.; Gao, X.W.; Luo, L. A ResNet50-Based Method for Classifying Surface Defects in Hot-Rolled Strip Steel. Mathematics 2021, 9, 2359. [Google Scholar] [CrossRef]
Shi, J.T.; Yang, J.; Zhang, Y.T. Research on Steel Surface Defect Detection Based on YOLOv5 with Attention Mechanism. Electronics 2022, 11, 3735. [Google Scholar] [CrossRef]
Zhang, L.; Fu, Z.P.; Guo, H.P.; Sun, Y.G.; Li, X.R.; Xu, M.L. Multiscale Local and Global Feature Fusion for the Detection of Steel Surface Defects. Electronics 2023, 12, 3090. [Google Scholar] [CrossRef]
Zhou, M.D.; Lu, W.T.; Xia, J.B.; Wang, Y.H. Defect Detection in Steel Using a Hybrid Attention Network. Sensors 2023, 23, 6982. [Google Scholar] [CrossRef]
Quan, Z.; Sun, J. A Feature-Enhanced Small Object Detection Algorithm Based on Attention Mechanism. Sensors 2025, 25, 589. [Google Scholar] [CrossRef] [PubMed]
Yeung, C.C.; Lam, K.M. Efficient fused-attention model for steel surface defect detection. IEEE Trans. Instrum. Meas. 2022, 71, 2510011. [Google Scholar] [CrossRef]
Tang, B.; Song, Z.K.; Sun, W.; Wang, X.D. An end-to-end steel surface defect detection approach via Swin transformer. IET Image Process. 2023, 17, 1334–1345. [Google Scholar] [CrossRef]
Liu, W.W.; Qiu, J.H.; Wang, Y.J.; Li, T.; Liu, S.J.; Hu, G.D.; Xue, L. Multiscale Feature Fusion Convolutional Neural Network for Surface Damage Detection in Retired Steel Shafts. J. Comput. Inf. Sci. Eng. 2024, 24, 041005. [Google Scholar] [CrossRef]
Yu, T.; Luo, X.; Li, Q.; Li, L. CRGF-YOLO: An optimized multi-scale feature fusion model based on YOLOv5 for detection of steel surface defects. Int. J. Comput. Intell. Syst. 2024, 17, 154. [Google Scholar] [CrossRef]
Song, K.C.; Yan, Y.H. A noise robust method based on completed local binary patterns for hot-rolled steel strip surface defects. Appl. Surf. Sci. 2013, 285, 858–864. [Google Scholar] [CrossRef]

Figure 1. The architecture of the proposed FAX-Net.

Figure 2. Schematic illustration of the SDAM.

Figure 3. Sample examples from the NEU-CLS dataset.

Figure 4. Training and validation loss over epochs.

Figure 5. Overall classification accuracy of different models on six types of steel surface defects.

Figure 6. Ablation study results of different models on overall classification accuracy.

Figure 7. Visualization of SDAM attention maps across defect types using Grad-CAM.

Figure 8. Comparison of F1-score heatmaps based on defect categories for CBAM and SDAM. (a) ConvNeXt + CBAM. (b) ConvNeXt + SDAM. (c) ConvNeXt + CBAM + TF-FPN. (d) FAX-Net.

Table 1. Performance of various models in classifying six types of steel plate surface defects, evaluated by precision, recall, and F1-score.

NUM	Model	Metric (%)	Cr	In	Pa	PS	RS	Sc	AVG
1	Resnet 50	Pre	66.39	98.25	100.00	64.23	84.75	98.89	85.42
		Rec	90.00	62.22	98.89	87.78	55.56	98.89	82.22
		F1-Score	76.42	76.19	99.44	74.18	67.11	98.89	82.04
2	VGG 16	Pre	78.76	98.08	98.85	78.26	100.00	86.41	90.06
		Rec	98.89	56.67	95.56	100.00	77.78	98.89	87.96
		F1-Score	87.68	71.83	97.18	87.80	87.50	92.23	87.37
3	Mobblienet v3	Pre	64.49	93.88	100.00	66.67	100.00	86.14	85.20
		Rec	98.89	51.11	96.67	97.78	36.67	96.67	79.63
		F1-Score	78.07	66.19	98.31	79.28	53.66	91.10	77.77
4	Swin Transformer	Pre	91.58	78.35	92.22	83.54	90.91	93.75	88.39
		Rec	96.67	84.44	92.22	73.33	100.00	83.33	88.33
		F1-Score	94.05	81.28	92.22	78.11	95.25	88.24	88.19
5	Vision Transformer	Pre	82.12	62.71	100.00	69.92	87.93	100.00	83.78
		Rec	76.67	82.22	98.89	95.56	56.67	75.56	80.93
		F1-Score	79.31	71.15	99.44	80.75	68.92	86.08	80.94
6	Alexnet	Pre	95.74	63.64	75.63	59.66	98.62	92.54	80.97
		Rec	100.00	54.44	100.00	78.89	64.44	68.89	77.78
		F1-Score	97.83	58.68	86.12	67.94	75.32	78.98	77.48
7	Efficientnet	Pre	98.90	73.61	100.00	63.89	97.80	100.00	89.03
		Rec	100.00	58.89	98.89	76.67	98.89	98.89	88.70
		F1-Score	99.45	65.43	99.44	69.70	98.34	99.44	88.63
8	ConvNeXt	Pre	94.83	76.92	96.49	93.22	98.46	76.74	89.44
		Rec	96.49	88.24	96.49	88.71	94.12	68.75	88.80
		F1-Score	95.65	82.19	96.49	90.91	96.24	72.53	89.00
9	FAX-Net	Pre	98.51	90.47	98.36	97.18	100.00	100.00	97.42
		Rec	98.51	97.74	96.77	98.57	100.00	94.74	97.67
		F1-Score	98.51	93.83	97.56	97.87	100.00	97.30	97.51

Table 2. Ablation study results of different models on precision, recall, and F1-score. ↑ indicates an improvement in the corresponding metric relative to the baseline model.

NUM	Model	Metric (%)	Cr	In	Pa	PS	RS	Sc	AVG
1	ConvNeXt	Pre	94.83	76.92	96.49	93.22	98.46	76.74	89.44	-
		Rec	96.49	88.24	96.49	88.71	94.12	68.75	88.80	-
		F1-Score	95.65	82.19	96.49	90.91	96.24	72.53	89.00	-
2	ConvNeXt+CBAM	Pre	95.59	89.13	93.33	90.57	98.48	88.06	92.53	↑ 3.09
		Rec	95.59	85.42	96.55	88.89	100.00	88.06	92.42	↑ 3.62
		F1-Score	95.59	87.23	94.92	89.72	99.24	88.06	92.46	↑ 3.46
3	ConvNeXt+SDAM	Pre	94.55	90.48	98.21	93.33	95.77	98.18	95.09	↑ 5.65
		Rec	100.00	95.00	98.21	90.32	100.00	87.10	95.11	↑ 6.31
		F1-Score	97.20	92.68	98.21	91.80	97.84	92.31	95.01	↑ 6.01
4	ConvNeXt+TF-FPN	Pre	89.09	90.77	98.25	96.43	98.00	89.61	93.69	↑ 4.25
		Rec	98.00	93.65	94.92	87.10	98.00	90.79	93.74	↑ 4.94
		F1-Score	93.33	92.19	96.55	91.53	98.00	90.20	93.63	↑ 4.63
5	ConvNeXt+CBAM +TF-FPN	Pre	92.11	94.20	96.36	96.49	100.00	86.21	94.23	↑ 4.79
		Rec	97.22	97.01	88.33	90.16	97.83	92.59	93.86	↑ 5.06
		F1-Score	94.59	95.59	92.17	93.22	98.90	89.29	93.96	↑ 4.96
6	ConvNeXt+SDAM +TF-FPN	Pre	98.51	90.48	98.36	97.18	100.00	100.00	97.42	↑ 7.98
		Rec	98.51	97.44	96.77	98.57	100.00	94.74	97.67	↑ 8.87
		F1-Score	98.51	93.83	97.56	97.87	100.00	97.30	97.51	↑ 8.51

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Jiang, Y.; Dai, J.; Jiang, Z. FAX-Net: An Enhanced ConvNeXt Model with Symmetric Attention and Transformer-FPN for Steel Defect Classification. Symmetry 2025, 17, 1313. https://doi.org/10.3390/sym17081313

AMA Style

Jiang Y, Dai J, Jiang Z. FAX-Net: An Enhanced ConvNeXt Model with Symmetric Attention and Transformer-FPN for Steel Defect Classification. Symmetry. 2025; 17(8):1313. https://doi.org/10.3390/sym17081313

Chicago/Turabian Style

Jiang, Yan, Jiaxin Dai, and Zhuoru Jiang. 2025. "FAX-Net: An Enhanced ConvNeXt Model with Symmetric Attention and Transformer-FPN for Steel Defect Classification" Symmetry 17, no. 8: 1313. https://doi.org/10.3390/sym17081313

APA Style

Jiang, Y., Dai, J., & Jiang, Z. (2025). FAX-Net: An Enhanced ConvNeXt Model with Symmetric Attention and Transformer-FPN for Steel Defect Classification. Symmetry, 17(8), 1313. https://doi.org/10.3390/sym17081313

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

FAX-Net: An Enhanced ConvNeXt Model with Symmetric Attention and Transformer-FPN for Steel Defect Classification

Abstract

1. Introduction

2. Related Work

2.1. Defect Inspection Approaches

2.2. Approaches Applications of Attention Mechanisms in Surface Defect Detection

2.3. Feature Fusion Techniques for Surface Defect Inspection

3. Method

3.1. The Framework of the Proposed FAX-Net

3.2. Symmetric Dual-Dimensional Attention Module

3.3. Feature Fusion Module

3.4. Training Implementation

4. Experiments

4.1. Defect Dataset and Evaluation Metrics

4.2. Cross-Entropy Loss Analysis

4.3. Comparative Experiments

4.4. Ablation Experiments

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI