MADC-Net: Densely Connected Network with Multi-Attention for Metal Surface Defect Segmentation

Ding, Xiaokang; Jiang, Xiaoliang; Wang, Sheng

doi:10.3390/sym17040518

Open AccessArticle

MADC-Net: Densely Connected Network with Multi-Attention for Metal Surface Defect Segmentation

by

Xiaokang Ding

¹,

Xiaoliang Jiang

^1,*

and

Sheng Wang

^2,*

¹

College of Mechanical Engineering, Quzhou University, Quzhou 324000, China

²

Department of Mechanical Engineering, Quzhou College of Technology, Quzhou 324000, China

^*

Authors to whom correspondence should be addressed.

Symmetry 2025, 17(4), 518; https://doi.org/10.3390/sym17040518

Submission received: 7 March 2025 / Revised: 24 March 2025 / Accepted: 28 March 2025 / Published: 29 March 2025

(This article belongs to the Special Issue Symmetry and Its Applications in Image Processing)

Download

Browse Figures

Versions Notes

Abstract

The quality of metal products plays a crucial role in determining their overall performance, reliability and safety. Therefore, timely and effective detection of metal surface defects is of great significance. For this purpose, we present a densely connected network with multi-attention for metal surface defect segmentation, called MADC-Net. Firstly, we selected ResNet50 as the encoder due to its robust performance. To capture richer contextual information from the defect feature map, we designed a densely connected network and incorporated the multi-attention of a CESConv module, an efficient channel attention module (ECAM), and a simple attention module (SimAM) into the decoder. In addition, in the final stage of the decoder, we introduced a reconfigurable efficient attention module (REAM) to reduce redundant calculations and enhance the detection of complex defect structures. Finally, a series of comprehensive comparative and ablation experiments were conducted on the publicly available SD-saliency-900 dataset and our self-constructed Bearing dataset, all of which validated that our proposed method was effective and reliable in defect segmentation. Specifically, the Dice and Jaccard scores for the SD-saliency-900 dataset were 88.82% and 79.96%. In comparison, for the Bearing dataset, the Dice score was 78.24% and the Jaccard score was 64.74%.

Keywords:

metal surface defect segmentation; densely connected; CESConv; efficient channel attention; simple attention module; reconfigurable efficient attention module

1. Introduction

As important industrial raw materials, metal products play a vital role in many industries, especially in aerospace, defense, automotive manufacturing, and light industry. However, due to many uncertainties regarding raw materials, manufacturing processes, and casting environments, defects such as cracks, oxide sheets, inclusions, and pores will occur on the surface of metal products. These defects not only affect the appearance of metal products, but also may reduce their service life and durability and even cause serious safety hazards in some applications. Therefore, timely and accurate detection of metal surface defects becomes particularly important. Through various testing methods, it can not only reduce production costs and improve production efficiency, but also ensure the reliability and safety of metal products.

In the past, identifying surface defects of metal products usually relied on manual sampling detection, which has many obvious shortcomings, such as a high labor cost, low detection efficiency, and insufficient detection accuracy. Therefore, with the improvement of production demand and industrial automation levels, automatic surface defect detection technology has gradually received extensive attention and research. At present, machine vision has become the mainstream technology in the field of surface defect detection. It adopts a non-contact measurement method, which avoids the problem of contact friction in traditional detection methods and can provide higher detection accuracy. More importantly, the stability and durability of a machine vision system enable it to operate in harsh production environments for long periods of time.

With the rapid development and continuous innovation of deep learning algorithms [1,2,3,4], their application in the field of surface defect detection of metal products has become increasingly widespread. By simulating the processing of the human brain, deep learning can extract complex features from large amounts of image data, and it can effectively identify subtle defects on metal surfaces. Song et al. [5] introduced a cross-granularity approach to enhance a model’s ability to accurately segment defects. This method focuses on labeling defect datasets with a coarse-grained approach, which typically consists of annotations that provide a relatively low resolution or generalization of defect features. Yu et al. [6] proposed a dynamic inference module coupled with a triplet loss function to enhance the model’s ability to learn and understand the relationships between different classes in the dataset. Zhu et al. [7] proposed a U-shaped structure based on a space channel transformer and designed a global auxiliary information extractor to compensate boundary features, which is very effective for metal surface defect detection. Zhang et al. [8] introduced a strategy known as the normalized mean square frequency category weighting approach, designed to address the issue of imbalanced defect categories in metal surface defect detection. Niu et al. [9] proposed a method that allows for precise control over both the defect region and the defect intensity in the generated images. This method introduces the concept of a defect direction vector and constructs the defect direction vector in the potential variable space.

Despite the considerable progress made with various methods in the field of metal product defect detection, the application of deep learning algorithms still faces significant challenges. One of the main difficulties is the relatively limited annotated data available for training, which makes it more difficult to generalize effectively. Additionally, the diverse and complex nature of metal defects, including their varying shapes, sizes, and textures, adds another layer of complexity to the detection task. In response to these challenges, researchers have increasingly turned to enhancing the capabilities of deep learning architectures by incorporating advanced techniques, including multi-scale information [10,11], transformers [12,13], dense connection [14,15], and attention mechanisms [16,17], to improve detection accuracy and robustness. Among them, Ning et al. [18] introduced a global attention module to enhance feature extraction by integrating both fine-grained channel attention and fine-grained spatial attention. By combining these two attention mechanisms, the ability to capture intricate defect patterns and subtle variations was significantly improved. Wang et al. [19] introduced a multi-acceptance field dense connection module to enhance feature extraction by capturing both local and contextual information across multiple receptive fields. Garbaz et al. [20] proposed a multi-stage feature extraction module that combines the strengths of convolutional neural network and transformer architectures to enhance feature representation. Tan et al. [21] developed a multi-scale feature dynamic aggregation module to enhance the extraction and fusion of multi-scale information from the encoder’s output feature maps.

Inspired by the aforementioned models, we propose a densely connected network architecture that integrates the strengths of a CESConv module, an efficient channel attention module, a simple attention module, and a reconfigurable efficient attention module. The primary contributions of our study can be summarized as follows:

(1): We employ a densely connected encoder–decoder architecture to significantly enhance the network’s feature extraction capability. This structure facilitates the efficient flow of information between layers by establishing direct connections, and it ensures that the characteristics of both the lower and upper layers are effectively utilized.
(2): We designed a multi-attention module that takes full advantage of a CESConv module, an efficient channel attention module, and a simple attention module. By combining these attention mechanisms, the proposed module effectively enhances feature discrimination, leading to improved accuracy and robustness in defect detection tasks.
(3): We introduce a reconfigurable efficient attention module to optimize computational efficiency while improving the detection of intricate defect structures. This module dynamically adjusts its attention mechanisms based on input variations, ensuring that computational resources are allocated effectively without unnecessary redundancy.

2. Materials and Methods

2.1. Overall Architecture of MADC-Net

As illustrated in Figure 1, the proposed MADC-Net adopts an encoder–decoder architecture to effectively extract and refine multi-scale features for metal surface defect segmentation. Compared to lighter backbones like MobileNetV2 or ResNet18, ResNet50 provides richer feature representations. Conversely, although deeper models like ResNet101 or Vision Transformers can extract more complex features, they significantly increase computational overhead. Based on the above considerations, MADC-Net’s encoder utilizes ResNet50 as the backbone and was chosen for its ability to balance computational efficiency and parameter reduction. First, an image with a size of 256 × 256 is fed into the backbone network, ResNet50, which generates four different feature maps at different levels. Then, each of the four feature maps is individually passed through the CESConv module to obtain richer context information. To facilitate effective feature reconstruction in the decoder, each CESConv processed feature undergoes a 3 × 3 convolution operation, followed by Up2, Up4, and Up8 upsampling strategies, progressively restoring all feature maps to the same spatial dimensions as the original input image. In addition, the features of the last layer are subjected to 3 × 3 convolution + ECAM and gnConv + SimAM, and the outputs from these two pathways are iteratively added to the corresponding feature maps from the preceding layers. Similarly, the resulting features are again subjected to 3 × 3 convolution + ECAM and gnConv + SimAM and added to the features of the second and first layers. Subsequently, the two features are added and the merged feature representation is fed into GSSM, which helps eliminate redundant calculations and enhance critical feature retention. Finally, the refined feature map undergoes a 1 × 1 convolution operation, followed by a sigmoid activation function, to generate the final binary segmentation mask.

2.2. CESConv Module

As shown in Figure 2, the CESConv module is composed of three primary functional units arranged in sequence: a channel reconstruction unit (CRU) [22], an efficient multi-scale attention module (EMAM) [23], and a spatial reconstruction unit (SRU) [22]. Specifically, the CRU is responsible for refining channel-wise features by reorganizing and recalibrating the information flow across different channels. Following this, the EMAM is introduced to strengthen the network’s ability to capture critical regions by incorporating multi-scale attention mechanisms. The SRU focuses on improving spatial feature refinement by preserving structural details and reducing noise interference, which is particularly beneficial for accurately localizing defects in metal surfaces. After passing through these three key units, the processed feature map undergoes a 1 × 1 convolution operation. By integrating these carefully designed components, the CESConv module significantly enhances feature extraction and selection, ultimately leading to improved performance in metal surface defect detection tasks.

2.2.1. Channel Reconstruction Unit

Figure 3 shows the structure of the channel reconstruction unit, which optimizes channel information allocation through split, transform, and fuse operations to enhance feature representation. The specific process is as follows:

Split: The input image, which consists of C channels, is systematically split into two distinct components. The first portion, comprising αC channels, is designated for global feature extraction. Meanwhile, the remaining (1 − α)C channels are allocated for local feature computation. The two parts are reduced dimensionally by 1 × 1 convolution to reduce computational effort and highlight key features.

Transform: The upper-channel feature,

X_{u p}

, is first processed through group-wise convolution (GWC) and point-wise convolution (PWC). These two convolutional operations extract and enhance distinct spatial and contextual details from the input, and their outputs are subsequently summed to generate a comprehensive merged feature representation, Y1. Meanwhile, the lower-channel feature,

X_{l o w}

, undergoes point-wise convolution which emphasizes finer hidden details within the shallow layers. To preserve essential low-level information, the transformed feature maps are then concatenated with the original

X_{l o w}

, forming the final representation of the lower stage, denoted as Y2.

Fuse: After the transformation process is completed, the feature maps Y1 and Y2 undergo pooling operations to produce the corresponding outputs, S1 and S2. These pooled representations serve as the foundation for computing the attention weights, β1 and β2, which are derived using the Softmax function. Guided by the feature importance vectors, β1 and β2, the upper-layer feature, Y1, and the lower-layer feature, Y2, are selectively fused along the channel dimension to obtain the channel-wise feature representation, Y.

In summary, the channel reconstruction unit minimizes redundancy in spatial fine feature mappings by efficiently refining feature representations along the channel dimensions. Simultaneously, a cost-effective computing strategy and feature reuse mechanism are used to identify and filter redundant features to optimize the overall feature representation. By balancing expressive feature extraction with computational efficiency, the CRU helps to achieve a more compact yet information-rich feature space, which ultimately improves the effectiveness of the model.

2.2.2. Efficient Multi-Scale Attention Module

The incorporation of a parallel substructure in the network architecture helps mitigate excessive sequential processing and prevents an excessively deep network. Building upon this foundation, we introduced the efficient multi-scale attention module in [23], with its structural details illustrated in Figure 4. This module leverages several key enhancements to optimize feature representation. Specifically, cross-spatial learning is integrated to capture long-range dependencies by modeling interactions across different spatial positions, while group-wise processing segments channels into smaller groups, reducing computational complexity without sacrificing representational power. Additionally, the inclusion of a local convolution enhancement (3 × 3 Conv) strengthens the extraction of fine-grained spatial details, and the adoption of stronger normalization strategies such as GroupNorm ensures improved stability across different batch sizes. Furthermore, fine-tuned attention weight adjustments enhance the module’s ability to balance global and local feature importance dynamically. Collectively, EMAM has powerful multi-scale feature modeling capabilities, making it an ideal solution for efficient attention mechanisms and powerful training stability tasks.

2.2.3. Spatial Reconstruction Unit

To effectively leverage the spatial redundancy present in feature representations, we introduced a spatial reconstruction unit [22], as illustrated in Figure 5. This unit operates through a combined separate-and-reconstruct mechanism, ensuring that meaningful spatial structures are preserved while redundant information is minimized. Specifically, the process begins with an intermediate feature map, where a trainable group normalization (GN) layer is applied to assess spatial pixel variance across different batches and channels. Next, the feature map undergoes a reweighting process, where its values are transformed through a sigmoid activation function, mapping them into a range of (0, 1). These values are then filtered using a gating mechanism with a predefined threshold, resulting in two distinct sets of information weights, denoted as

W_{1}

and

W_{2}

. Finally, the original input feature map is multiplied by these weight matrices, producing two complementary outputs:

X_{1}^{W}

and

X_{2}^{W}

. To minimize spatial redundancy and enhance feature efficiency, we introduced a cross-reconstruction operation that strategically combines features with varying information densities. This process involves integrating high-information features with those containing relatively less information to produce a more enriched and expressive feature representation. By this means, the method effectively reinforces critical spatial details while suppressing redundant or less meaningful components. As a result, the SRU not only maximizes the retention of valuable information but also optimizes memory usage, ensuring a more compact yet informative feature space.

2.3. Multi-Attention Module

2.3.1. Efficient Channel Attention Module

Figure 6 illustrates the architecture of the efficient channel attention module [24], which is designed to enhance channel-wise feature representation while maintaining computational efficiency. Specifically, the process begins with global average pooling (GAP), which aggregates spatial information by computing the mean value across all spatial positions for each channel. To further improve the module’s adaptability to varying feature scales, an adaptive kernel size selection mechanism is employed. Following this, a 1 × 1 convolution is applied to refine the extracted channel descriptors, reducing dimensionality while preserving essential feature interactions. Then, the output is processed using a sigmoid activation function, which normalizes the computed attention weights to a range between 0 and 1. Finally, the resulting channel attention weight vector is multiplied element by element with the original input features of all the channels. By dynamically adjusting the importance of each channel, ECAM effectively strengthens the network’s ability to model complex feature relationships while maintaining efficiency.

2.3.2. gnConv Module

As shown in Figure 7, the gnConv module [25] is an efficient convolutional structure designed to enhance feature extraction capabilities while reducing computational costs. First, the input feature map undergoes a channel expansion projection, increasing the number of channels from C to 2C. This step enriches the feature representation, allowing the network to capture more diverse and expressive patterns. Once the feature dimension is extended, the transformed feature map is fed into a depth-wise convolution (DWConv). This operation efficiently extracts spatial information by processing each channel independently while reducing computational overhead compared to standard convolutions. Following the DWConv, the features are split into multiple subgroups and undergo separate projection operations. Each projected sub-feature set is then subjected to a multiplication (Mul) mechanism, which facilitates the integration of information across different scales. This process ensures that fine-grained spatial details and broader contextual dependencies are effectively combined, leading to a more informative representation. Finally, after multi-layer projection + multiplication fusion, the feature map with shape C is generated. The advantages of the gnConv module are reduced computational complexity and enhanced feature extraction capabilities, which mean that it can be used in a variety of efficient CNN structures, such as lightweight classification networks or detection models.

2.3.3. SimAM

SimAM is a computationally efficient and lightweight attention mechanism designed to enhance feature representation without introducing extra learnable parameters. By leveraging statistical properties such as variance and mean, SimAM effectively assigns importance weights to different spatial locations, improving the expressiveness of feature maps. A detailed PyTorch 2.0 implementation of SimAM can be found in [26].

2.4. Reconfigurable Efficient Attention Module

Figure 8 illustrates the architecture of the reconfigurable efficient attention module, which effectively combines gnConv, SimAM, Strip Pooling, and 3 × 3 convolution to capture both local and global features while optimizing computational efficiency. The input feature map is first split into two processing pathways. In the first path, the features are passed through gnConv, which performs feature transformation and channel-wise information extraction. Meanwhile, in the second path, SimAM is utilized to compute attention weights in a parameter-free manner. The outputs from these two paths are then multiplied element-wise, allowing the feature maps to be selectively enhanced based on their computed importance. After that, the processed features are input into the Strip Pooling [27], which is a mechanism designed to capture remote context dependencies and then add them to the intermediate features from the previous stage. Next, the fused feature map is further refined using gnConv, which enhances the feature representation by performing additional transformations. To preserve the original characteristics of the input, these processed features are added back to the initial input feature map, allowing residual learning to contribute to a more robust output. Finally, a 3 × 3 convolution is applied to extract fine-grained details and further refine the feature representation. As a result, the REAM outputs an enhanced feature map that is not only more expressive and discriminative but also computationally efficient, making it well-suited for high-performance vision tasks.

2.5. Loss Function

Unlike traditional loss functions, Dice loss [28,29] offers a more robust optimization mechanism by emphasizing the degree of overlap between the predicted region and the actual ground truth. Therefore, we adopted Dice as our loss criterion, which is computed as follows:

L_{d i c e} (y, p) = 1 - \frac{2 \sum_{i = 1}^{N} p_{i} y_{i}}{\sum_{i = 1}^{N} y_{i} + \sum_{i = 1}^{N} p_{i}}

(1)

where

N

is the number of pixels and

p_{i}

and

y_{i}

are the actual ground truth and the predicted region of pixel

i

. The experiment of loss function using MADC-Net is shown in Table 1.

3. Results

3.1. Dataset

SD-saliency-900 dataset: The SD-saliency-900 dataset is a publicly available dataset for segmenting strip defects. It encompasses a diverse range of defect types, including inclusions, patches, and scratches, making it highly suitable for evaluating the robustness of defect segmentation models. The dataset comprises a total of 900 images, which are systematically divided into three subsets: 540 images for training, 180 images for validation, and 180 images for testing. Figure 9 illustrates representative samples from the SD-saliency-900 dataset, which is accessible for research and experimentation purposes via the source provided in [30].

Bearing dataset: The Bearing dataset is a private collection designed for defect segmentation and includes various types of surface defects, such as wear, grooves, and scratches. It contains a total of 4758 images, and the system is divided into three subsets: 2854 for training, 952 for validation, and 952 for testing. Figure 10 is a representative sample of the dataset generated and analyzed for this study. Although it is not publicly available, access can be granted upon reasonable request by the original author [31].

3.2. Evaluation Metrics

To comprehensively evaluate the performance of MADC-Net, we employed two widely used evaluation metrics: the Dice coefficient [32,33] and the Jaccard index [34,35]. Their precise mathematical formulations are given as follows:

D i c e = \frac{2 T P}{2 T P + F N + F P}

(2)

J a c c a r d = \frac{T P}{T P + F N + F P}

(3)

3.3. Implementation Details

The implementation of MADC-Net was carried out using the PyTorch framework and trained on a high-performance workstation equipped with an NVIDIA RTX A6000 GPU, which featured 48 GB of memory. For the training process, the batch size was set to 32, the learning rate was initialized at 0.001, and the total number of training iterations was configured to 200. The model parameters were updated using Adam optimization, which was selected because of its adaptive learning rate ability and its ability to effectively handle sparse gradients. To maintain consistency in input size and enhance network compatibility, all images were resized to a fixed resolution of 256 × 256 pixels. Figure 11 shows the change curves of loss and accuracy for MADC-Net on the SD-saliency-900 and Bearing datasets. For the SD-saliency-900 dataset, the train loss and validation loss fluctuated greatly in the first 25 epochs, which may have been due to the instability of the model in the process of adapting to the data distribution. As the training progressed, the loss levelled off and eventually remained at a low level, indicating that the model converged. Both the training accuracy and validation accuracy rose rapidly during the first few rounds of training and levelled off after the first 25 epochs. The training accuracy was finally close to 0.99, while the verification accuracy was slightly lower, about 0.98, indicating that the model had achieved a high segmentation performance on the SD-saliency-900 dataset. For the Bearing dataset, the decline trend of the training loss and validation loss was similar to that for SD-saliency-900, with a rapid initial decline but a relatively lower overall loss value, indicating that MADC-Net has a better fitting ability on this dataset. Both the training accuracy and validation accuracy rose rapidly within the first 10 epochs and approached 1.0 after 50 epochs, indicating that the model learned the segmentation task on this dataset faster.

3.4. Ablation Experiments

To systematically evaluate the impact of each component within MADC-Net, we conducted a series of ablation experiments on the SD-saliency-900 dataset, with the quantitative results summarized in Table 2. The Baseline was first established by constructing a densely connected network integrated with ECAM, gnConv, and SimAM. This initial configuration achieved a Dice of 87.65% and a Jaccard of 78.10%, serving as a reference for subsequent enhancements. To assess the effectiveness of CESConv, we incorporated it into the Baseline, increasing the Dice to 87.95% and the Jaccard to 78.55%. This enhancement indicates that CESConv contributes to better feature extraction and representation. Similarly, we evaluated the influence of REAM by adding it independently to the Baseline. The results showed a Dice of 88.04% and a Jaccard of 78.72%, which indicates that REAM can effectively refine the feature learning and improve the segmentation accuracy. Finally, we explored the combined effect of CESConv and REAM by integrating both into the Baseline. This configuration achieved the best performance, with the Dice reaching 88.82% and the Jaccard improving to 79.96%. These findings confirm that each module positively impacts segmentation performance, with the most significant gains observed when CESConv and REAM are used together. Meanwhile, looking at the values of the number of parameters (in millions), frames per second (FPS), and Giga floating point operations per second (GFLOPs) in Table 2, we found that incorporating CESConv and REAM into the Baseline resulted in only a marginal increase in the total parameter count. Additionally, they had little impact on computing efficiency, as the FPS and GFLOPs remained almost constant. This indicates that the modifications introduced do not significantly affect the model’s complexity or processing speed while potentially enhancing its performance.

Figure 12 provides a visual comparison of the segmentation results obtained from the ablation study conducted on the SD-saliency-900 dataset. The first row represents the original images containing various strip defects, such as inclusions, scratches, and patches, and the second row shows their annotated images. The third row shows the results of Baseline, which is a densely connected network of ECAM, gnConv, and SimAM. This model captured general defect areas, but there were significant inconsistencies and missing details, especially in complex defect patterns. The fourth row shows the segmentation results of Baseline + CESConv, which slightly improved the segmentation performance by including CESConv in Baseline. The detected defect boundaries became finer, and smaller defect areas were captured more efficiently. However, some minor defects remained unsegmented. The fifth row shows the segmentation results for Baseline + REAM, and the segmentation accuracy was further improved when REAM was added to Baseline alone. However, some complex defect structures remained incomplete. The sixth row shows the segmentation results of Baseline + CESConv + REAM. The segmentation effect was best when CESConv and REAM were combined. This confirms that CESConv and REAM work together to enhance feature extraction and defect representation.

3.5. Comparison Experiments

To thoroughly assess the segmentation capabilities of the various models on the SD-saliency-900 dataset, we conducted a series of comparative experiments, with the results summarized in Table 3. These methods included LW-IRSTNet [36], LOANet [37], FastICENet [38], DGNet [39], CMUNeXt [40], BCS-Net [41], and A-Net [42]. Among all the models, BCS-Net exhibited the lowest segmentation performance, achieving a Dice of 86.58% and a Jaccard of 76.45%. While it demonstrates some ability to segment defect regions, its accuracy is relatively limited. LOANet and FastICENet showed slightly better outcomes than BCS-Net, with Dices of 87.35% and 87.37% and Jaccards of 77.61% and 77.66%, respectively. Although these models exhibit moderate segmentation accuracy, their performance in capturing intricate defect structures and maintaining precise boundary delineation remains suboptimal. DGNet achieved a Dice of 87.48% and a Jaccard of 77.80%, in contrast to the previous model, indicating that it is capable of fine segmentation, but still it did not achieve top performance. A-Net achieved a commendable performance, with a Dice of 87.92% and a Jaccard of 78.51%. It demonstrates high accuracy in segmenting defect regions, but there is still potential for enhancing its ability to capture complex structures and refine object boundaries for more precise segmentation results. On the other hand, CMUNeXt and LW-IRSTNet delivered strong segmentation results, attaining Dices of 88.31% and 88.38% and Jaccards of 79.13% and 79.24%, respectively. However, their segmentation masks may still include minor misclassifications, leaving room for further optimization. Among all the models, MADC-Net emerged as the best-performing approach, achieving the highest Dice of 88.82% and Jaccard of 79.96%. The superior performance of MADC-Net highlights its ability to effectively capture both local and global structural features, enabling more accurate segmentation of complex defect patterns.

Figure 13 presents the qualitative segmentation results of these models, with the first and second rows showing the original images and corresponding ground-truth masks and the remaining rows illustrating the predictions of each model. From a visual perspective, LOANet and FastFCENet demonstrate a commendable ability to capture the overall structure of defective regions. However, their segmentations occasionally lack precision in finer details, leading to minor inconsistencies when compared to the ground truth. On the other hand, DGNet produces relatively complete segmentations, effectively identifying defective areas. Despite this strength, it tends to over-segment, resulting in an increased number of false positives, which could misrepresent the true extent of defects. In contrast, LW-IRSTNet and MADC-Net stand out for their exceptional segmentation accuracy. Their outputs closely align with the ground truth, displaying well-defined boundaries and minimal background interference.

To evaluate the segmentation performance of the various models on the Bearing dataset, we conducted a comparative analysis of both quantitative and qualitative results. Table 4 presents the segmentation accuracy of different methods in terms of Dice and Jaccard metrics, while Figure 14 visually illustrates the segmentation outcomes. From a qualitative perspective, MADC-Net and LOANet produced the most accurate segmentations, effectively capturing the shape and structure of the defective regions with minimal noise and high consistency with the ground truth. This aligns with their high Dice (0.7824 and 0.7805) and Jaccard (0.6474 and 0.6449) scores, as reported in Table 4. The defect boundaries in their segmentations are well-defined, and they show strong resistance to false detections. CMUNeXt also performed competitively, yielding segmentations that maintained a good balance between capturing the defects and suppressing background noise. This performance was reflected in its Dice of 0.7729 and Jaccard of 0.6359. However, in some cases, CMUNeXt slightly under-segmented certain defective areas, leading to minor discrepancies. FastICENet and DGNet demonstrated moderate performance, with Dices of 0.7584 and 0.7595 and Jaccards of 0.6166 and 0.6183, respectively. Their segmentations showed reasonable accuracy but occasionally missed smaller defect regions or introduced minor false positives, leading to slight deviations from the ground-truth masks. LW-IRSTNet, BCS-Net, and A-Net exhibited the lowest segmentation performance, with a Dice around 0.75 and a Jaccard near 0.61. These models struggle with detecting finer defect structures and sometimes fail to distinguish between defective and non-defective regions, leading to less precise segmentations. Overall, the visual outcomes in Figure 14 align well with the quantitative evaluation in Table 4, emphasizing the strengths of MADC-Net in achieving high segmentation accuracy.

3.6. Computational Efficiency

To evaluate the computational efficiency of various methods on the SD-saliency-900 dataset, we analyzed the number of parameters, frames per second, and Giga floating point operations per second, as presented in Table 5. Among the evaluated models, FastICENet achieved the highest FPS (163.98) with a relatively low GFLOPs (0.5632), indicating high efficiency. Similarly, LW-IRSTNet had the smallest number of parameters (0.1612 M) and low GFLOPs (0.3014), while still maintaining a decent FPS (70.38). CMUNeXt had a significantly higher GFLOPs (7.4177) than the others but maintained a high FPS (146.41), suggesting optimized computation despite its complexity. BCS-Net had the highest parameter count (44.8226 M) and GFLOPs (14.8458), leading to the lowest FPS (12.40), indicating high computational cost and slower inference. With 0.5496 M parameters and 1.5075 GFLOPs, MADC-Net’s FPS (52.39) was moderate but lower than that of FastICENet, which implies a trade-off between complexity and efficiency.

3.7. Limitations

Although MADC-Net demonstrates strong performance in most scenarios, it encounters challenges with certain images, as illustrated in Figure 15. Specifically, when the defect region closely resembles the surrounding background in terms of gray-level intensity or is significantly influenced by variations in lighting conditions, the model struggles to accurately differentiate between defective and normal areas. To address these limitations, future research will integrate contrastive learning strategies, which focus on learning the discriminative distance between defective and non-defective regions. This approach is expected to improve the model’s ability to identify minor defects and enhance its robustness under varying imaging conditions.

4. Conclusions

In this study, we introduce MADC-Net, a densely connected network with multi-attention mechanisms designed for metal surface defect segmentation. By leveraging ResNet50 as the encoder, MADC-Net effectively extracts hierarchical features while maintaining a strong balance between computational efficiency and segmentation accuracy. To further enhance defect detection, the network integrates a densely connected decoder equipped with multi-attention modules, including CESConv, ECAM, and SimAM, which work in synergy to refine feature representation and improve segmentation precision. Additionally, the incorporation of the reconfigurable efficient attention module at the final decoding stage optimizes computational efficiency by reducing redundant calculations while enhancing the model’s ability to detect intricate defect structures. Extensive comparative and ablation experiments conducted on both the SD-saliency-900 dataset and the self-constructed Bearing dataset validate the effectiveness and robustness of MADC-Net in defect segmentation. Our method achieved Dice and Jaccard scores of 88.82% and 79.96% on the SD-saliency-900 dataset, while delivering a 78.24% Dice and a 64.74% Jaccard on the Bearing dataset, demonstrating strong generalization capabilities across different datasets. Overall, MADC-Net provides a highly accurate, computationally efficient, and reliable solution for defect segmentation that is excellent in terms of both accuracy and robustness, making it ideal for industrial applications.

Beyond metal surface defect segmentation, MADC-Net can be adapted to a wide range of other defect types and industries. For example, in the electronics industry, MADC-Net could be extended for the detection of defects in printed circuit boards, such as soldering issues, cracks, or faulty components. In automotive manufacturing, the model could be applied to detect surface imperfections on car bodies, tires, or internal components that are critical for vehicle performance and safety. Additionally, MADC-Net’s robust segmentation capabilities could be leveraged in textile production for detecting fabric defects, such as tears, holes, or weaving inconsistencies.

Author Contributions

Conceptualization, X.D. and X.J.; methodology, X.D.; validation, X.J. and S.W.; writing—original draft preparation, X.D. and X.J.; writing—review and editing, X.D.; visualization, X.J.; supervision, S.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Joint Fund of Zhejiang Provincial Natural Science Foundation of China, grant numbers LZY24E050001 and ZCLTGS24E0601; the National Natural Science Foundation of China, grant number 62102227; and the Science and Technology Major Projects of Quzhou, grant numbers 2023K221 and 2024K191.

Data Availability Statement

SD-saliency-900 is a public dataset, and its link is mentioned in the paper. The Bearing dataset generated and/or analyzed during the current study is not publicly available but may be obtained from the original author upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Liu, Y.; Qin, Y.; Lin, Z.; Xia, H.; Wang, C. Detection of scratch defects on metal surfaces based on MSDD-UNet. Electronics 2024, 13, 3241. [Google Scholar] [CrossRef]
García Peña, D.; García Pérez, D.; Díaz Blanco, I.; Juárez, J.M. Exploring deep fully convolutional neural networks for surface defect detection in complex geometries. Int. J. Adv. Manuf. Tech. 2024, 134, 97–111. [Google Scholar]
Zhang, Z.; Wang, W.; Tian, X.; Tan, J. Data-driven semantic segmentation method for detecting metal surface defects. IEEE Sens. J. 2024, 24, 15676–15689. [Google Scholar]
Song, K.; Feng, H.; Cao, T.; Cui, W.; Yan, Y. MFANet: Multifeature aggregation network for cross-granularity few-shot seamless steel tubes surface defect segmentation. IEEE Trans. Ind. Inform. 2024, 20, 9725–9735. [Google Scholar]
Bakirci, M. Vehicular mobility monitoring using remote sensing and deep learning on a UAV-based mobile computing platform. Measurement 2025, 244, 116579. [Google Scholar]
Yu, R.; Guo, B. Dynamic reasoning network for image-level supervised segmentation on metal surface defect. IEEE Trans. Instrum. Meas. 2024, 73, 5018010. [Google Scholar]
Zhu, W.; Liang, R.; Yang, J.; Cao, Y.; Fu, G.; Cao, Y. A sub-region Unet for weak defects segmentation with global information and mask-aware loss. Eng. Appl. Artif. Intel. 2023, 122, 106011. [Google Scholar]
Zhang, Z.; Wang, W.; Tian, X. Semantic segmentation of metal surface defects and corresponding strategies. IEEE Trans. Instrum. Meas. 2023, 72, 5016813. [Google Scholar]
Niu, S.; Li, B.; Wang, X.; Peng, Y. Region-and strength-controllable GAN for defect generation and segmentation in industrial images. IEEE Trans. Ind. Inform. 2022, 18, 4531–4541. [Google Scholar] [CrossRef]
Ma, P.; Wang, G.; Li, T.; Zhao, H.; Li, Y.; Wang, H. STCS-Net: A medical image segmentation network that fully utilizes multi-scale information. Biomed. Opt. Express 2024, 15, 2811–2831. [Google Scholar] [CrossRef]
Wang, Y.; Zhou, Y.; Wu, H.; Liu, X.; Zhai, X.; Sun, K.; Tian, C.; Zhao, H.; Li, T.; Jia, W.; et al. Mfcanet: A road scene segmentation network based on multi-scale feature fusion and context information aggregation. J. Vis. Commun. Image R. 2024, 98, 104055. [Google Scholar] [CrossRef]
Zhang, J.; Ye, Z.; Chen, M.; Yu, J.; Cheng, Y. TransGraphNet: A novel network for medical image segmentation based on transformer and graph convolution. Biomed. Signal Proces. 2025, 104, 107510. [Google Scholar] [CrossRef]
Patil, S.S.; Ramteke, M.; Rathore, A.S. Permutation Invariant self-attention infused U-shaped transformer for medical image segmentation. Neurocomputing 2025, 625, 129577. [Google Scholar] [CrossRef]
Liu, M.; Liu, P.; Zhao, L.; Ma, Y.; Chen, L.; Xu, M. Fast semantic segmentation for remote sensing images with an improved short-term dense-connection (STDC) network. Int. J. Digit. Earth 2024, 17, 2356122. [Google Scholar] [CrossRef]
Zhang, J.; Zhang, Y.; Qiu, H.; Wang, T.; Li, X.; Zhu, S.; Huang, M.; Zhuang, J.; Shi, Y.; Xu, X. Constrained multi-scale dense connections for biomedical image segmentation. Pattern Recogn. 2025, 158, 111031. [Google Scholar] [CrossRef]
Dong, S.; Cao, J.; Wang, Y.; Ma, J.; Kuang, Z.; Zhang, Z. Lightweight multi-scale encoder-decoder network with locally enhanced attention mechanism for concrete crack segmentation. Meas. Sci. Technol. 2025, 36, 025021. [Google Scholar] [CrossRef]
Chen, Q.; Wang, J.; Yin, J.; Yang, Z. CFFANet: Category feature fusion and attention mechanism network for retinal vessel segmentation. Multimed. Syst. 2024, 30, 332. [Google Scholar] [CrossRef]
Ning, G.; Liu, P.; Dai, C.; Sun, M.; Zhou, Q.; Li, Q. RGAM: A refined global attention mechanism for medical image segmentation. IET Comput. Vis. 2024, 18, 1362–1375. [Google Scholar] [CrossRef]
Wang, X.; Cao, W. MRFDCNet: Multireceptive field dense connection network for real-time semantic segmentation. Mob. Inf. Syst. 2022, 2022, 6100292. [Google Scholar] [CrossRef]
Garbaz, A.; Oukdach, Y.; Charfi, S.; El Ansari, M.; Koutti, L.; Salihoun, M.; Lafraxo, S. MLFE-UNet: Multi-level feature extraction transformer-based UNet for gastrointestinal disease segmentation. Int. J. Imag. Syst. Tech. 2025, 35, e70030. [Google Scholar] [CrossRef]
Tan, D.; Yao, Z.; Peng, X.; Ma, H.; Dai, Y.; Su, Y.; Zhong, W. Multi-level medical image segmentation network based on multi-scale and context information fusion strategy. IEEE Trans. Em. Top. Comput. Intell. 2024, 8, 474–487. [Google Scholar]
Li, J.; Wen, Y.; He, L. Scconv: Spatial and channel reconstruction convolution for feature redundancy. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 6153–6162. [Google Scholar]
Ouyang, D.; He, S.; Zhang, G.; Luo, M.; Guo, H.; Zhan, J.; Huang, Z. Efficient multi-scale attention module with cross-spatial learning. In Proceedings of the ICASSP 2023—2023 IEEE International Conference on Acoustics, Speech and Signal Processing, Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 11534–11542. [Google Scholar]
Rao, Y.; Zhao, W.; Tang, Y.; Zhou, J.; Lim, S.N.; Lu, J. Hornet: Efficient high-order spatial interactions with recursive gated convolutions. Adv. Neural Inf. Process. Syst. 2022, 35, 10353–10366. [Google Scholar]
Yang, L.; Zhang, R.Y.; Li, L.; Xie, X. Simam: A simple, parameter-free attention module for convolutional neural networks. In Proceedings of the International Conference on Machine Learning, Online, 18–24 July 2021; pp. 11863–11874. [Google Scholar]
Hou, Q.; Zhang, L.; Cheng, M.M.; Feng, J. Strip pooling: Rethinking spatial pooling for scene parsing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 4003–4012. [Google Scholar]
Luo, H.; Zhou, D.M.; Cheng, Y.J.; Wang, S.Q. MPEDA-Net: A lightweight brain tumor segmentation network using multi-perspective extraction and dense attention. Biomed. Signal Process. Control 2024, 91, 106054. [Google Scholar]
Yuan, H.J.; Chen, L.N.; He, X.F. MMUNet: Morphological feature enhancement network for colon cancer segmentation in pathological images. Biomed. Signal Process. Control 2024, 91, 105927. [Google Scholar]
SD-Saliency-900 Dataset. Available online: https://gitee.com/dengzhiguang/EDRNet (accessed on 20 November 2024).
Bearing Dateset. Available online: https://m.tb.cn/h.TXXh4Hf?tk=ls6p3Gamn1C (accessed on 20 November 2024).
Selvaraj, A.; Nithiyaraj, E. CEDRNN: A convolutional encoder-decoder residual neural network for liver tumour segmentation. Neural Process. Lett. 2023, 55, 1605–1624. [Google Scholar]
Li, Y.; Zhang, Y.; Liu, J.Y.; Wang, K.; Zhang, K.; Zhang, G.S.; Liao, X.F.; Yang, G. Global transformer and dual local attention network via deep-shallow hierarchical feature fusion for retinal vessel segmentation. IEEE Trans. Cybern. 2022, 53, 5826–5839. [Google Scholar]
Yang, M.Y.; Shen, Q.L.; Xu, D.T.; Sun, X.L.; Wu, Q.B. Striped WriNet: Automatic wrinkle segmentation based on striped attention module. Biomed. Signal Proces. 2024, 90, 105817. [Google Scholar]
Yang, C.; Li, B.; Xiao, Q.; Bai, Y.; Li, Y.; Li, Z.; Li, H.; Li, H. LA-Net: Layer attention network for 3D-to-2D retinal vessel segmentation in OCTA images. Phys. Med. Biol. 2024, 69, 045019. [Google Scholar]
Kou, R.; Wang, C.; Yu, Y.; Peng, Z.; Yang, M.; Huang, F.; Fu, Q. LW-IRSTNet: Lightweight infrared small target segmentation network and application deployment. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5621313. [Google Scholar]
Han, X.; Liu, Y.; Liu, G.; Lin, Y.; Liu, Q. LOANet: A lightweight network using object attention for extracting buildings and roads from UAV aerial remote sensing images. PeerJ Comput. Sci. 2023, 9, e1467. [Google Scholar]
Zhang, X.; Zhao, Z.; Ran, L.; Xing, Y.; Wang, W.; Lan, Z.; Yin, H.; Hee, H.; Liud, Q.; Zhang, B.; et al. FastICENet: A real-time and accurate semantic segmentation model for aerial remote sensing river ice image. Signal Process. 2023, 212, 109150. [Google Scholar] [CrossRef]
Ji, G.P.; Fan, D.P.; Chou, Y.C.; Dai, D.; Liniger, A.; Van Gool, L. Deep gradient learning for efficient camouflaged object detection. Mach. Intell. Res. 2023, 20, 92–108. [Google Scholar] [CrossRef]
Tang, F.; Ding, J.; Quan, Q.; Wang, L.; Ning, C.; Zhou, S.K. CMUNeXt: An efficient medical image segmentation network based on large kernel and skip fusion. In Proceedings of the 2024 IEEE International Symposium on Biomedical Imaging, Athens, Greece, 27–30 May 2024; pp. 1–5. [Google Scholar]
Cong, R.; Yang, H.; Jiang, Q.; Gao, W.; Li, H.; Wang, C.; Zhao, Y.; Kwong, S. BCS-Net: Boundary, context, and semantic for automatic COVID-19 lung infection segmentation from CT images. IEEE Trans. Instrum. Meas. 2022, 71, 5019011. [Google Scholar] [CrossRef]
Chen, B.; Niu, T.; Yu, W.; Zhang, R.; Wang, Z.; Li, B. A-net: An a-shape lightweight neural network for real-time surface defect segmentation. IEEE Trans. Instrum. Meas. 2023, 73, 5001314. [Google Scholar] [CrossRef]

Figure 1. Overall architecture of MADC-Net.

Figure 2. Structure of CESConv module.

Figure 3. Structure of channel reconstruction unit.

Figure 4. Structure of efficient multi-scale attention module.

Figure 5. Structure of spatial reconstruction unit.

Figure 6. Structure of efficient channel attention module.

Figure 7. Structure of gnConv module.

Figure 8. Structure of reconfigurable efficient attention module.

Figure 9. Representative samples from the SD-saliency-900 dataset and corresponding annotations.

Figure 10. Representative samples from the Bearing dataset and corresponding annotations.

Figure 11. Loss and accuracy curves of MADC-Net. The first is the result for the SD-saliency-900 dataset. The second is the result for the Bearing dataset.

Figure 12. Visualization of ablation experiments on the SD-saliency-900 dataset. The first to last rows show original images, mask images, Baseline, Baseline + CESConv, Baseline + REAM, and Baseline + CESConv + REAM.

Figure 13. Visualization of comparison experiment on the SD-saliency-900 dataset. The first to last rows show original images, mask images, LW-IRSTNet, LOANet, FastICENet, DGNet, CMUNeXt, BCS-Net, A-Net, and MADC-Net.

Figure 14. Visualization of comparison experiment on the Bearing dataset. The first to last rows show original images, mask images, LW-IRSTNet, LOANet, FastICENet, DGNet, CMUNeXt, BCS-Net, A-Net, and MADC-Net.

Figure 15. Failure cases from different datasets. The first to third rows show original images, mask images, and MADC-Net on the SD-saliency-900 dataset. The fourth to last rows show original images, mask images, and MADC-Net on the Bearing dataset.

Table 1. Comparison experiment of different loss functions on the SD-saliency-900 dataset.

Loss Function	Dice	Jaccard
Dice	0.8882	0.7996
$Tversky (α = 0.1$ $, β = 0.9$ )	0.7582	0.6138
$Tversky (α = 0.9$ $, β = 0.1$ )	0.8617	0.7578
IoU	0.8289	0.7091
FocalTversky	0.8457	0.7341

Table 2. Ablation experiments on the SD-saliency-900 dataset.

Method	Dice	Jaccard	Params (M)	FPS	GFLOPs
Baseline	0.8765	0.7810	0.3659	80.0056	1.0105
Baseline + CESConv	0.8795	0.7855	0.4778	60.1794	1.1527
Baseline + REAM	0.8804	0.7872	0.4376	66.4669	1.3653
Baseline + CESConv + REAM	0.8882	0.7996	0.5496	52.3901	1.5075

Table 3. Comparison experiments on the SD-saliency-900 dataset.

Method	Dice	Jaccard
LW-IRSTNet [36]	0.8838	0.7924
LOANet [37]	0.8735	0.7761
FastICENet [38]	0.8737	0.7766
DGNet [39]	0.8748	0.7780
CMUNeXt [40]	0.8831	0.7913
BCS-Net [41]	0.8658	0.7645
A-Net [42]	0.8792	0.7851
MADC-Net	0.8882	0.7996

Table 4. Comparison experiments on the Bearing dataset.

Method	Dice	Jaccard
LW-IRSTNet [36]	0.7510	0.6084
LOANet [37]	0.7805	0.6449
FastICENet [38]	0.7584	0.6166
DGNet [39]	0.7595	0.6183
CMUNeXt [40]	0.7729	0.6359
BCS-Net [41]	0.7541	0.6114
A-Net [42]	0.7523	0.6089
MADC-Net	0.7824	0.6474

Table 5. Parameters and computational efficiency of all methods on the SD-saliency-900 dataset.

Method	Params (M)	FPS	GFLOPs
LW-IRSTNet [36]	0.1612	70.38	0.3014
LOANet [37]	1.3873	73.45	1.3876
FastICENet [38]	0.9644	163.98	0.5632
DGNet [39]	0.5712	49.39	0.6628
CMUNeXt [40]	3.1492	146.41	7.4177
BCS-Net [41]	44.8226	12.40	14.8458
A-Net [42]	0.3898	58.89	0.6089
MADC-Net	0.5496	52.39	1.5075

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ding, X.; Jiang, X.; Wang, S. MADC-Net: Densely Connected Network with Multi-Attention for Metal Surface Defect Segmentation. Symmetry 2025, 17, 518. https://doi.org/10.3390/sym17040518

AMA Style

Ding X, Jiang X, Wang S. MADC-Net: Densely Connected Network with Multi-Attention for Metal Surface Defect Segmentation. Symmetry. 2025; 17(4):518. https://doi.org/10.3390/sym17040518

Chicago/Turabian Style

Ding, Xiaokang, Xiaoliang Jiang, and Sheng Wang. 2025. "MADC-Net: Densely Connected Network with Multi-Attention for Metal Surface Defect Segmentation" Symmetry 17, no. 4: 518. https://doi.org/10.3390/sym17040518

APA Style

Ding, X., Jiang, X., & Wang, S. (2025). MADC-Net: Densely Connected Network with Multi-Attention for Metal Surface Defect Segmentation. Symmetry, 17(4), 518. https://doi.org/10.3390/sym17040518

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

MADC-Net: Densely Connected Network with Multi-Attention for Metal Surface Defect Segmentation

Abstract

1. Introduction

2. Materials and Methods

2.1. Overall Architecture of MADC-Net

2.2. CESConv Module

2.2.1. Channel Reconstruction Unit

2.2.2. Efficient Multi-Scale Attention Module

2.2.3. Spatial Reconstruction Unit

2.3. Multi-Attention Module

2.3.1. Efficient Channel Attention Module

2.3.2. gnConv Module

2.3.3. SimAM

2.4. Reconfigurable Efficient Attention Module

2.5. Loss Function

3. Results

3.1. Dataset

3.2. Evaluation Metrics

3.3. Implementation Details

3.4. Ablation Experiments

3.5. Comparison Experiments

3.6. Computational Efficiency

3.7. Limitations

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI