1. Introduction
The casting quality of turbine blades is critical for ensuring the safe operation of aircraft engines under extreme operational conditions. Consequently, pre-service defect inspection is crucial for maintaining engine integrity and flight safety. Radiographic inspection, as shown in
Figure 1, owing to its non-invasive nature, is widely employed to detect potential defects within the casting blades. However, the current inspection process depends on expert knowledge and experience, leading to inefficient and subjective assessments. This underscores the urgent need for automated defect detection in radiography images of casting blades.
Recently, numerous researchers have employed deep learning-based methods for turbine blade inspection [
1,
2,
3,
4,
5] and other industrial defect detection fields [
6,
7,
8,
9]. For instance, Liu et al. [
1] proposed an attention mechanism-enhanced feature fusion network based on You Only Look Once (YOLO) for wind turbine blade surface defect detection, which employs a bidirectional feature pyramid and a classification loss function with an attenuation factor to address sample imbalance. Qi et al. [
2] developed a semi-supervised framework that integrates three parallel self-supervised mechanisms with a vision transformer backbone for aero-engine blade defect segmentation, which employs patch-level distortions and a class-centric loss to mitigate class imbalances with limited labeled samples. To address the challenges of tiny defects and weak features in aero-engine blade inspection, Ma et al. [
3] introduced SPDP-Net, which leverages semantic prior mining to capture pixel-level defect location priors and employs a defect enhancement perception module to separate weak defects from complex backgrounds. Sun et al. [
4] introduced Surface Defect Detection-Detection Transformer (SDD-DETR) by applying the detection transformer architecture to aero-engine blade surface defect detection, which employs a multi-scale deformable attention module and a lightweight feed-forward network to reduce computational complexity while maintaining high detection accuracy. Wang et al. [
5] employed a pyramid vision transformer with spatial-cross overlapping embedding and dual attention mechanisms to capture fuzzy boundary features and model long-range spatial interactions for casting defect detection in radiographic images with low visibility. Beyond turbine blade inspection, deep learning techniques have also been extensively applied to other industrial defect detection domains. Liu et al. [
6] developed a real-time anchor-free detector with global and local feature enhancement modules to address complex backgrounds and small-size defects, while incorporating a box refinement module to capture multi-scale defect shapes. Ma et al. [
7] introduced Efficient Linear Attention (ELA) model based on YOLOv8, which incorporates linear attention mechanisms and a selective feature pyramid network to balance detection accuracy and computational efficiency for steel surface defect detection under complex industrial conditions. Yang et al. [
8] proposed a lightweight detection model for internal tunnel lining defects in ground-penetrating radar images, which employs a dual attention module to achieve efficient defect recognition with reduced model size. For metal surface defect segmentation, Li et al. [
9] proposed Cross-Dimensional Adaptive Region reconstruction Network (CDARNet) with a cross-dimensional adaptive region reconstruction module to suppress background noise and a contextual self-correlation semantic unification module to enhance defect localization across multiple scales.
Generally, these approaches can be categorized into two primary network architectures: detection networks and segmentation networks. Detection networks, such as the Region Convolutional Neural Network (R-CNN) series [
10], YOLO series [
11], and the DETR series [
12], generate bounding boxes for defect localization and classification. In contrast, segmentation networks, including Fully Convolutional Networks (FCN) series [
13] and U-Net [
14] series, perform pixel-wise classification of defect regions, enabling precise delineation of defect location and shape. Thus, segmentation networks are increasingly utilized for industrial defect inspection tasks. Despite these advances, automated defect segmentation in blade casting faces three major challenges that significantly affect the accuracy:
- (1)
First, the poor image quality inherent in industrial radiography, characterized by low signal-to-noise ratios, uneven illumination, and imaging artifacts, obscures defect boundaries and compromises reliable feature extraction.
- (2)
Second, defects exhibit both intraclass variance and interclass similarity: the same defect type may appear vastly different under varying casting conditions, while distinct defect categories may share similar textural characteristics in challenging imaging scenarios.
- (3)
Third, irregular defect geometries introduce complex morphological variations, including arbitrary shapes, non-uniform size distributions, and unpredictable spatial orientations, which are difficult to capture with fixed receptive fields in conventional detection networks.
These three challenges are illustrated in
Figure 2 through representative defect image examples.
Moreover, most existing segmentation models extract features solely in the spatial domain, limiting their ability to fully capture image texture information. As a result, these methods demonstrate suboptimal performance when segmenting blade casting defects in radiographic images, particularly in addressing the aforementioned three challenges.
To address the aforementioned challenges, we propose SFCF-Net, a Spatial-Frequency Complementary Fusion Network for blade casting defect segmentation. Our framework leverages the Discrete Cosine Transform (DCT), a widely used frequency domain transformation method that serves as an important complement to spatial-domain information in image processing tasks [
15,
16,
17,
18,
19,
20], to establish complementary relationships between spatial and frequency modalities for comprehensive texture characterization. The network architecture comprises three key components working synergistically. The Selective Cross-modal Calibration (SCC) module performs DCT-based frequency transformation and employs memory-guided gating mechanisms to selectively filter noise and misaligned information across modalities, effectively mitigating quality degradation in radiographic images. The Cross-modal Refinement and Complementation (CRC) module establishes feature correspondences through intra-modal refinement and inter-modal complementation, achieving robust discrimination amid intraclass variance and interclass similarity. The Asymmetric Window Attention (AWA) module employs bidirectional rectangular windows with varying orientations to flexibly capture irregular defect structures, overcoming the limitations of fixed square receptive fields in conventional approaches. Extensive validation on the ATBCD-Seg dataset and a public benchmark demonstrate superior performance over state-of-the-art methods. The main contributions of this work are summarized as follows:
We designed a novel SFCF-Net architecture that exploits dual-domain complementarity between spatial and frequency representations to achieve accurate blade casting defect segmentation in radiographic inspection;
The proposed SCC module employs memory-guided gated mechanisms to selectively calibrate cross-modal features by filtering noise and misaligned information, effectively suppressing imaging artifacts while preserving defect-relevant characteristics in poor imaging quality scenarios;
The proposed CRC module employs intra-modal refinement and inter-modal complementation to establish robust feature correspondences across modalities, mitigating intraclass variance while enhancing interclass separability;
The proposed AWA module employs bidirectional rectangular windows to capture complete defect morphologies with diverse aspect ratios, enabling precise characterization of irregular defect structures.
The remainder of our paper is structured as follows:
Section 2 presents the related works.
Section 3 describes the details of the proposed SFCF-Net architecture. As for
Section 4, it presents the implementation process and experimental analysis. Finally,
Section 5 concludes the paper and suggests directions for future work.
3. Proposed Method
The overall framework of the proposed SFCF-Net is illustrated in
Figure 3a. It adopts a hierarchical encoder–decoder architecture designed to exploit the complementarity between spatial and frequency domains for accurate blade casting defect segmentation.
Encoder: We employ the Swin Transformer (STAM) [
48] as the encoder backbone to extract multi-scale spatial-domain features. The encoder consists of four hierarchical stages, generating features
(
) at progressively decreasing resolutions, providing rich multi-scale representations for subsequent processing.
Frequency-guided decoder: The designed decoder comprises four progressive stages that perform hierarchical feature refinement and spatial-frequency synergistic learning. As shown in
Figure 3b, in the decoder, we first transform spatial features
into the frequency domain, then recalibrate feature representations using the SCC module, which considers the complementarity of spatial and frequency domains, to obtain the recalibrated features
and
. Subsequently, the CRC module computes spatial-frequency feature affinities to obtain an aggregated cross-modal feature
. To capture global contextual dependencies of irregular defect geometries, the AWA module applies shape-adaptive rectangular window attention to characterize defect structures, yielding
. Finally, to facilitate effective cross-modal feature interaction across different hierarchical levels and enable comprehensive information exchange between cross-modal patterns and global spatial context, we fuse
and
to obtain predictive features
, which facilitates the generation of more accurate segmentation results. This overall process can be formally defined as:
where
and
functions represent the ReLU activation function and Layer Normalization (LN), respectively,
indicates the channel concatenation operation, whereas ⊖ denotes the element-wise subtraction.
To optimize the proposed SFCF-Net, a supervised learning strategy is applied to defect maps across different hierarchical levels. Specifically, we compute cross-entropy loss
between the predicted semantic maps
and ground truth maps
at each of the four stages. The total loss function
is obtained by accumulating the cross-entropy losses for each
, which can be formulated as follows:
This multi-level supervision strategy enables the network to learn defect representations at different semantic scales, facilitating more effective feature learning.
3.1. Selective Cross-Modal Calibration
To address the challenge of poor image quality in blade casting defect segmentation, we propose the Selective Cross-modal Calibration (SCC) module. This module consists of two sequential stages: frequency transformation (
Section 3.1.1) and cross-modal calibration (
Section 3.1.2). The frequency transform stage converts spatial-domain features into the frequency domain using DCT, while the cross-modal calibration stage employs memory-guided gated mechanisms to selectively filter noise and misaligned information, adaptively recalibrating feature representations to preserve fine-grained defect details under poor imaging conditions.
3.1.1. Frequency Transform
The process of frequency transformation is illustrated in
Figure 4. This module first converts spatial-domain features into the frequency domain using the DCT. The DCT can be expressed as follows:
where
represents the spectrogram with frequency components
in horizontal and vertical directions. The image block size is
N, and
denotes spatial positions with
. Moreover,
is the compensation coefficient with value
if
, and
otherwise.
In this study, to maintain more detailed texture information of radiographic images, the input feature maps
are partitioned into patches
(where
n is the total number of patches,
C represents the channel dimension, and
M ×
M denotes the size of a patch) as represented in the image block in Equation (3). Next, DCT is performed at the patch level to generate the frequency-domain features
. This process is expressed as follows:
where
and
represent fold and unfold operations, respectively, whereas
j denotes the frequency coefficient.
3.1.2. Cross-Modal Calibration
Although the frequency-domain features obtained by frequency transform can be a supplement of the spatial-domain features, the two modal features still contain a substantial amount of redundant and noisy information due to the low image quality, which may introduce interference to the cross-modal interaction signals, ultimately diminishing the impact of cross-modal synergy on the final segmentation results. To address this problem, we calibrate cross-modal features through a memory-guided gated mechanism.
As shown in
Figure 5, we reassign and reactivate the modal-specific representations (
and
) jointly generated by the encoder and frequency transform. In order to enable the model to fully explore the low-quality image textures from the modal-specific information, the SCC module incorporates a memory vector and a forget gate to selectively filter out noise and misaligned information. Specifically,
and
are first transformed with the 1 × 1 convolutional layer and linear layer, which projects the multi-modal features into a lower-dimensional latent space, obtaining
and
:
where
and
represent a linear layer and a 1 × 1 convolutional layer, respectively. Then
and
are fed into the similarity function
to calculate the cosine similarity matrix, which represents the similarity score map between modalities. From this interaction, the spatial-oriented remember vector
is generated to capture essential cross-modal information by re-weighting the spatial features based on their correlation with the frequency domain:
where
represents the matrix multiplication, and
denotes the SoftMax operation. To regulate the influx of noise and mismatched information, a forget vector
is constructed by applying a sigmoid activation function to a linear transformation of
:
where
represents the sigmoid activation function. This forget vector acts as a gate to selectively filter out redundant counterparts that might interfere with crucial cross-modal signals. Finally, the recalibrated spatial-domain feature
formed by the gated fusion of
and
, which prevents the forget gate from excessively filtering out valuable information:
Similarly, the recalibrated frequency-domain features can be derived through a symmetric procedure, where the frequency-domain features are refined under the guidance of spatial correlations, ensuring a robust mutual enhancement between the two domains.
3.2. Cross-Modal Refinement and Complementation Module
Blade casting defects in radiographic images exhibit intraclass variance and interclass similarity, complicating the aggregation of features with diverse appearances within the same category and the discrimination between visually similar categories. Some researchers have tried to address this challenge in the industrial defect detection scenarios [
49,
50,
51]. However, these approaches demonstrate limitations in blade casting defect segmentation. First, radiographic image degradation limits the effectiveness of spatial-domain feature extraction and discriminative modeling. Second, existing methods operate exclusively within spatial representations, overlooking complementary information in other modalities.
To address these challenges, the CRC module is designed, as illustrated in
Figure 6. The CRC employs a two-stage design. In the first stage, Intra-modal Refinement (IaR) layer establishes intra-modal feature dependencies to mitigate intraclass variance by strengthening semantic consistency among features of the same defect category. In the second stage, the Inter-modal Complementation (IeC) layer establishes cross-modal correspondences between spatial and frequency domains, exploiting their complementary discriminative strengths to enhance interclass separability. Specifically, the linear mappings are first performed on the input
and
to generate the query, key, and value of the IaR layer. This process can be formulated as:
where
,
,
,
,
and
are the mapping matrixes of the two modalities.
represents the window partition layer. Then, the semantic feature correlations within each modality are modeled within windows, obtaining
and
:
where
d is the dimension of query and key vectors in the attention layer,
B denotes the relative position bias. Similarly, the query, key and value of IeC layer can be obtained by the linear mapping:
After that, we compute the cross-modal attention to exploit the complementary information between spatial and frequency domains:
where
and
represent the information interaction across modalities. Finally, we use the convolutional fusion layer to fuse
and
, with the fused features denoted as
. This process can be expressed as follows:
where
denotes the window reverse layer.
3.3. Asymmetric Window Attention
Blade casting defects exhibit highly irregular geometric characteristics with arbitrary shapes, diverse aspect ratios, and unpredictable spatial orientations. Traditional convolutional methods [
52,
53,
54,
55] with fixed square receptive fields struggle to capture the complete morphological structures of such defects, particularly for defects with random aspect ratios or varying sizes. To address this challenge, we propose the AWA module, which employs bidirectional rectangular windows to flexibly capture irregular defect geometries and expand receptive fields along different orientations.
The architecture of the AWA module is illustrated in
Figure 7. Let
denote the input feature at the i-th decoder stage. The AWA module partitions the feature map into two types of non-overlapping rectangular windows processed in parallel through separate attention branches: horizontal windows (H-W) with size m × n, where m < n, designed to capture horizontally elongated defect patterns, and vertical windows (V-W) with size m × n, where m > n, designed to model vertically elongated defect structures.
For the horizontal window branch, the query
, key
, and value
are computed through linear projections. Similarly, the vertical window branch computes
,
, and
. Within each window, the attention operation is performed as:
where
and
represent the output of the horizontal and vertical window attention branches, respectively.
denotes the squared ReLU activation function. Unlike the Softmax function, the squared ReLU redistributes attention weights across all spatial positions; the squared ReLU acts as a hard filter to suppress low-relevance regions. After attention computation, the features are projected and reversed back to the original spatial arrangement. The outputs from both branches are then concatenated to produce the enhanced feature
:
where
denotes the linear projection layer.
To accommodate different feature resolutions across decoder stages, the window sizes are progressively adjusted following a stage-specific configuration. Specifically, for high-resolution stages (stages 1–2), larger window sizes are employed: 8 × 64 for horizontal windows and 64 × 8 for vertical windows at stage 1, and 4 × 32 for horizontal windows and 32 × 4 for vertical windows at stage 2. Conversely, for low-resolution stages (stages 3–4), smaller windows are utilized: 2 × 16 and 16 × 2 at stage 3, and 1 × 8 and 8 × 1 at stage 4, maintaining appropriate receptive field coverage relative to feature resolution. Unlike squared windows that restrict the attention area uniformly, bidirectional rectangular windows expand the receptive field to capture more textures of defect shape along specific orientations. Through this design, the AWA module effectively captures complete defect morphologies with diverse aspect ratios and spatial orientations, enabling precise characterization of irregular defect structures.
4. Experiments and Results
To validate the effectiveness of our proposed approach, we conduct comprehensive experiments on two datasets: ATBCD-Seg and NEU-Seg. This section begins with an overview of the datasets, evaluation metrics, and implementation details, followed by detailed experimental results and analysis.
4.1. Datasets Description
ATBCD-Seg dataset: The ATBCD-Seg dataset was constructed by digitizing original aero-engine turbine blade photographs acquired through traditional film photography using an industrial digital film scanner, as shown in
Figure 1. The dataset contains 1200 individual turbine blade defect images and comprises two common casting defect categories observed in blade manufacturing: Slag Inclusion (SI) and Redundancy (Re), with a balanced class distribution ratio of 1:1. Defects were meticulously labeled under the guidance of experts from turbine blade manufacturers. The dataset was split into training, validation, and test sets using a 7:2:1 ratio at the original image level. To increase data diversity, each blade defect image was then cropped nine times at a resolution of 224 × 224, placing the defect at the center and at eight surrounding positions. This preprocessing resulted in 7560 training images, 2160 validation images, and 1080 test images.
Figure 8 presents the image data processing pipeline and representative defect images from the ATBCD-Seg dataset, with all nine crops from each original image remaining within the same split to prevent data leakage.
NEU-Seg dataset: To further evaluate the generalization of the proposed method, SFCF-Net was tested on the NEU-Seg dataset [
56], representing additional industrial inspection scenarios. The dataset contains 300 images (200 × 200 resolution) for each of the three typical surface defects in hot-rolled steel bars: Patches (Pa), Inclusions (In), and Scratches (Sc). The dataset was divided into training, validation, and test sets with a 7:2:1 ratio.
Figure 9 presents some example images in the NEU-Seg dataset; the problems of intraclass variance, interclass similarity, and irregular defect geometries also exist in this dataset.
4.2. Evaluation Metrics
To evaluate the proposed method along with other state-of-the-art approaches, we compute Intersection-over-Union (IoU), Accuracy (Acc), Dice coefficient, Precision, and Recall for each defect type, and subsequently calculate the overall mean Intersection-over-Union (mIoU), mean Accuracy (mAcc), mean Dice (mDice), mean Precision (mPrecision), and mean Recall (mRecall). These formulas are as follows:
where
denotes the total number of classes (including the background), and
TP,
FP, and
FN represent True Positives, False Positives, and False Negatives, respectively.
Furthermore, Foreground-Background IoU (FBIoU), which ignores class information, is employed to assess the model’s ability to distinguish defect regions from normal background. Here, the foreground is formed by the union of all defect class pixels (Re and SI), while the background represents non-defect regions. This can be expressed as follows:
where
and
denote the foreground IoU and background IoU, respectively.
4.3. Implementation Details
In this study, experiments were conducted on a workstation equipped with an Intel Xeon Gold 6226R CPU @ 2.90 GHz, 128 GB Samsung DDR4 RAM, and an NVIDIA GeForce RTX 3090 GPU. The software environment consisted of Python 3.8.5, PyTorch 1.13.1 with CUDA 11.7, and Torchvision 0.14.1. For training, we employed the AdamW optimizer with an initial learning rate and weight decay of 10−4. A warmup strategy was applied for the first 200 iterations, followed by cosine annealing decay. The batch size was set to eight, and the input size was set to 224 × 224 pixels. The model was validated after each epoch. Training was conducted for a maximum of 500 epochs, with early stopping applied when the validation Dice score showed no improvement for 50 consecutive epochs. Moreover, we designed two models with different parameter quantities: SFCF-Net-S, adopting the small version of STAM and SFCF-Net-B, using the base version of STAM; both models were pre-trained on ImageNet.
4.4. Comparative Experiments on ATBCD-Seg and NEU-Seg Datasets
4.4.1. Quantitative Comparison
To verify the effectiveness of the proposed method, we compared SFCF-Net with a variety of state-of-the-art segmentation models, including CNN-based methods (FCN [
13], U-Net [
14], Deeplabv3 [
57], Deeplabv3+ [
58]), transformer-based methods (Segnext [
59], Segformer [
60]), and hybrid architectures (Upernet [
61], Psanet [
62], Pspnet [
63], Gcnet [
64], Icnet [
65], Stdc [
66]). Additionally, we compare against the recent defect-specific method CSEPNet [
67], CDARNet [
9], LGGFormer [
68], SPDP-Net [
3], REA-Net [
33], and CPDNet [
69], as well as the general-purpose Segment Anything Model (SAM) [
70]. For SAM, we employed the pretrained ViT-B as an image encoder for fine-tuning. The training prompts were derived from the samples: box prompts consist of the top three minimal bounding boxes covering the defect masks, while point prompts are randomly selected as five points within each mask. To ensure fair comparison, all baseline methods were trained with the same training schedule as the proposed SFCF-Net, including input image size (224 × 224), pretrained backbone initialization, and other training configurations.
The quantitative results on the ATBCD-Seg dataset are presented in
Table 1. It can be observed that SFCF-Net (b) achieves the best performance across all evaluation metrics, reaching 88.39% mIoU and 93.59% mAcc. Meanwhile, our lightweight variant SFCF-Net (s) also demonstrates competitive performance with 87.14% mIoU and 92.61% mAcc. Compared with classic methods, SFCF-Net (b) outperforms FCN by 22.91% in mIoU and U-Net by 20.14%, demonstrating the effectiveness of spatial-frequency feature representation for capturing defect textures under poor imaging conditions. Among advanced methods, Deeplabv3 and Deeplabv3+ achieve mIoU of 81.80% and 82.13%, respectively, while SFCF-Net (b) surpasses them by 6.59% and 6.26%. Transformer-based methods such as Segnext (79.02%) and Segformer (80.59%) show inferior performance, suggesting that self-attention alone is insufficient without frequency-domain information. Notably, when compared with SAM, SFCF-Net (b) achieves 2.97% improvement in mIoU (from 85.42% to 88.39%) and 2.24% in mAcc (from 91.35% to 93.59%). The improvement is particularly significant for individual defect categories, where SFCF-Net (b) surpasses SAM by 5.60% for SI defect (from 79.19% to 84.79%) and 3.26% for Re defect (from 77.34% to 80.60%). The class-wise performance reveals that SI defects generally achieve higher IoU than Re defects across all methods. For instance, FCN obtains 65.67% for SI but only 31.92% for Re. In contrast, SFCF-Net (b) achieves more balanced performance with 84.79% and 80.60%, respectively, demonstrating a smaller performance gap (4.19%) compared to FCN (33.75%). This validates the effectiveness of our CRC module in handling intraclass variance and interclass similarity.
Moreover,
Table 2 presents results on the NEU-Seg dataset. SFCF-Net (b) achieves the best performance with 86.87% mIoU and 97.92% mAcc, outperforming SAM by 0.84% and Segnext by 0.46%. When analyzing class-wise IoU, SFCF-Net (b) demonstrates more balanced performance across defect types (86.54% for Pa defect, 83.45% for Sc defect, 79.61% for In defect) compared to other methods. These results validate the robust generalization capability of our approach across diverse industrial scenarios. The performance gap of mIoU between SFCF-Net (b) and SAM is larger on ATBCD-Seg (2.97%) than on NEU-Seg (0.84%), indicating that spatial-frequency representations are more effective under degraded radiographic imaging conditions, which validates our design rationale of incorporating frequency-domain information for blade casting defect segmentation.
4.4.2. Qualitative Comparison
To evaluate the effectiveness of the proposed SFCF-Net, we visualize segmentation results on the ATBCD-Seg and NEU-Seg datasets in
Figure 10 and
Figure 11, respectively. For clarity, we analyze representative cases that highlight the distinctive challenges in defect segmentation and demonstrate how our approach addresses these challenges.
Qualitative comparisons on the ATBCD-Seg dataset are shown in
Figure 10. Each column represents a defect type, and each row, from top to bottom, represents the original image, Ground Truth (GT), and the results of the compared methods. In the first column, several methods fail to accurately locate the boundaries of the Re defect, resulting in incomplete over-segmentation. This boundary ambiguity issue is common in radiographic images where defect edges gradually fade into the background. In contrast, our method accurately captures the defect boundaries by leveraging the SCC module, which preserves high-frequency edge information through frequency-domain features. In the second column, uneven brightness and low contrast hinder the complete segmentation of the defective region. For example, Pspnet produces incomplete Re defect segmentation, while U-Net and Deeplabv3+ erroneously split a single defect into multiple segments. These segmentation errors demonstrate their vulnerability to intensity variations and can lead to incorrect defect counting in practical applications. In contrast, our method incorporates the frequency domain to capture additional defect textures, reducing segmentation errors. In the third and fourth columns, the SI defects exhibit heterogeneous intra-class patterns, further complicating segmentation. For example, Deeplabv3 misclassifies a portion of the SI defect as Re in the third column, and Icnet misclassifies the background as Re in the fourth column. These errors indicate insufficient discriminative capability between visually similar defect categories. By leveraging frequency information and modeling the relationship between spatial and frequency-domain features through our CRC module, our method accurately distinguishes defect-free and defect regions, achieving more precise segmentation.
Qualitative comparisons on the NEU-Seg dataset are presented in
Figure 11. In the first row, our method demonstrates superior capability in describing fine defect patterns and intricate details through the AWA module, which leverages rectangle window attention to capture geometric dependencies. In the second row, many competing methods fail to completely cover the actual defect area. However, our method maintains segmentation completeness by capturing the fuzzy boundary features through SCC module. Meanwhile, as illustrated in the third row, other methods are prone to background misclassification. In contrast, our method effectively suppresses background interference through cross-modal feature refinement and compensation in the CRC module, enabling accurate distinction between defect and non-defect regions. These visual comparisons complement the quantitative results and demonstrate the practical effectiveness of our method in real industrial inspection scenarios.
4.5. Ablation Study
To assess the effectiveness of individual components and the overall framework, we conduct a comprehensive ablation study on the ATBCD-Seg dataset.
4.5.1. Effectiveness of Different Modules
To verify the effectiveness of each designed module, we conduct a series of ablation experiments on the ATBCD-Seg dataset. The baseline model is constructed by removing the SCC, CRC, and AWA modules while retaining the rest of the network architecture. We then progressively incorporate the proposed modules to evaluate their individual and collective contributions to segmentation performance. The ablation results are presented in
Table 3.
As shown in
Table 3, when incorporating the SCC module into the baseline, the performance improves substantially, with mIoU increasing by 4.66% to 81.84%. This validates the effectiveness of introducing frequency-domain features and performing progressive cross-modal recalibration, which enhances defect boundary localization under poor imaging conditions. The combination of SCC and CRC modules further boosts the performance to 83.89% mIoU, demonstrating a 2.05% improvement over SCC alone. This indicates that establishing cross-modal feature correspondences through intra- and inter-modal alignment significantly enhances the model’s capability to distinguish subtle defect variations while maintaining robustness to appearance inconsistencies. Similarly, combining SCC with AWA achieves 83.13% mIoU, demonstrating the effectiveness of dilated attention in capturing irregular defect geometries. When combining CRC and AWA modules without frequency-domain features, the model achieves 82.47% mIoU, which is lower than any SCC-based configuration. This result underscores the critical role of frequency-domain information in blade casting defect segmentation, without which the model struggles to fully characterize defect textures. This validates our design rationale that frequency-domain features are essential for addressing the challenge of poor image quality in radiographic inspection. The optimal performance is achieved with the full SFCF-Net, which integrates all three proposed modules, reaching 88.39% mIoU, 92.34% mRecall, 93.41% mDice, 93.22% mPrecision, and 90.27% FBIoU. Compared to the baseline, the complete framework demonstrates improvements of 11.21% in mIoU, 2.53% in mRecall, 8.10% in mDice, 11.71% in mPrecision, and 10.22% in FBIoU. These results evidence the high effectiveness of our method in capturing detailed defect features under challenging visual conditions. To provide a more intuitive representation of the ablation study results,
Figure 12 illustrates the same performance metrics using a radial bar chart visualization, where the baseline with SCC is denoted as Model 1, the baseline with SCC and CRC as Model 2, the baseline with SCC and AWA as Model 3, and the baseline with CRC and AWA as Model 4.
4.5.2. Effectiveness of Frequency-Domain Features in SCC Module
To investigate the effectiveness of the introduced frequency-domain features in the SCC module and the contribution of different frequency components in defect feature learning, our method leverages the complete frequency spectrum, including both low-frequency and high-frequency information, to comprehensively capture defect texture features in blade casting radiographic images. We conduct comparative experiments using three configurations: utilizing only low-frequency components by exploiting the first half of the frequency spectrum (denoted as Low), using only high-frequency components from the second half of the spectrum (denoted as High), and leveraging all frequency components across the complete spectrum (denoted as Full).
As shown in
Table 4, the model employing all frequency information achieves optimal performance across all evaluation metrics (mIoU: 88.39%, mAcc: 93.59%, mPre: 93.22%, FBIoU: 90.27%), which demonstrates the effectiveness of integrating both low and high-frequency information. Notably, the configuration utilizing only low-frequency information exhibits the poorest performance, with substantial degradation of approximately 6.24% in mIoU compared to the full spectrum approach. This observation can be explained by the fact that low-frequency components primarily encode coarse structural information and smooth intensity variations, which are insufficient for precisely delineating defect boundaries and capturing subtle texture anomalies present in radiographic images. In contrast, the high-frequency configuration achieves significantly better results compared to the low-frequency variant, as high-frequency components encapsulate critical boundary information and fine-grained textural patterns that are essential for accurate defect localization. However, relying exclusively on high-frequency information also leads to suboptimal results, with a performance gap of 1.66% in mIoU compared to the full spectrum, suggesting that low-frequency components provide necessary contextual cues that are crucial for robust segmentation.
To further demonstrate the importance of high-frequency information and the complementary nature of different frequency bands, we visualize the heat maps, as illustrated in
Figure 13. From left to right, we represent the original image, GT, and the heat maps of different frequency components. The low-frequency configuration produces diffuse activation patterns with imprecise spatial localization, while the high-frequency variant shows improved edge sensitivity but generates scattered and discontinuous activation regions with incomplete defect coverage. In contrast, by utilizing the complete frequency spectrum, our model effectively focuses on irregular defect morphologies while suppressing background interference. The full-spectrum approach successfully integrates fine-grained boundary details from high-frequency components with global structural context from low-frequency components, enabling the network to generate spatially coherent predictions that accurately capture complex defect geometries in radiographic images.
4.5.3. Effectiveness of Rectangular Window Design in AWA Module
The AWA module employs bidirectional rectangular windows to expand receptive fields along different orientations for capturing irregular defect geometries. To validate this design, we compare five window configurations. Square windows employ a uniform 8 × 8 size across all stages. Horizontal windows (H-W) use progressively decreasing sizes of 8 × 64, 4 × 32, 2 × 16, and 1 × 8 for stages 1–4, while Vertical windows (V-W) use sizes of 64 × 8, 32 × 4, 16 × 2, and 8 × 1 for the corresponding stages. Fixed-Bidirectional (F-B) maintains constant sizes of 8 × 64 and 64 × 8 across all stages, whereas Progressive-Bidirectional (P-B) combines both orientations at each stage with sizes (8 × 64, 64 × 8), (4 × 32, 32 × 4), (2 × 16, 16 × 2), and (1 × 8, 8 × 1) for stages 1–4, respectively. The ablation results are presented in
Table 5.
As shown in
Table 5, Square windows achieve the poorest performance (mIoU: 85.73%), as uniform receptive fields fail to capture elongated defect structures. Horizontal (86.72% mIoU) and Vertical (86.21% mIoU) configurations show improvements but exhibit asymmetric performance: Horizontal achieves 79.25% IoU on Re defects while Vertical obtains 82.89% on SI defects, indicating orientation-specific limitations. This validates that single-orientation windows cannot comprehensively handle defects with diverse aspect ratios. F-B demonstrates the benefit of parallel processing (87.45% mIoU, 83.67% IoU for Re), but fixed large windows lead to suboptimal feature extraction across different resolutions. In contrast, P-B achieves optimal performance (88.39% mIoU, 93.59% mAcc, 90.27% FBIoU), improving mIoU by 0.94% over F-B, with more balanced performance across defect types (Re: 80.60%, SI: 84.79%). The progressive window size reduction enables scale-adaptive feature extraction: larger windows in early stages capture global context, while smaller windows in later stages preserve fine-grained details, enabling robust characterization of irregular defect geometries.
To provide a more intuitive representation of these results,
Figure 14 visualizes the performance metrics, demonstrating the superior balance achieved by our P-B design across different evaluation dimensions.
4.5.4. Effectiveness of Feature Combination Strategies
To establish effective connections between cross-modal and global information, we extract two complementary feature representations: the cross-modal feature
from the CRC module and the global feature
from the AWA module, as described in
Section 3. These features facilitate comprehensive information exchange between local cross-modal patterns and global spatial context. Based on these representations, we explore three fusion strategies to optimally combine these heterogeneous features. As illustrated in
Figure 15, the first strategy employs parallel fusion, denoted as
P, where both feature types are processed simultaneously and concatenated along the channel dimension. The two other strategies use sequential fusion with different processing orders:
processes cross-modal features first and then incorporates global features via element-wise addition, while
processes global features first and subsequently integrates cross-modal features via element-wise addition. As reported in
Table 6, the parallel fusion configuration consistently outperforms both sequential alternatives, suggesting that simultaneous processing of cross-modal and global information is more effective than sequential integration. Accordingly, we adopt the parallel fusion strategy throughout our experimental framework.
4.5.5. Effectiveness of Different Backbone Networks
To investigate the influence of backbone architecture on model performance, we compare several CNN-based and transformer-based backbones. As shown in
Table 7, CNN-based backbones achieve less favorable performance compared to transformer-based backbone networks (Pyramid Vision Transformer (PVT) [
71] and STAM). This difference can be attributed to the transformer’s ability to capture long-range dependencies and global context, which is critical for blade defect segmentation under challenging radiographic imaging conditions. In contrast, CNN-based backbones, relying on local receptive fields, are prone to segmentation errors.
However, transformer-based backbones introduce considerable computational complexity due to their self-attention mechanisms for capturing long-range dependencies, resulting in reduced inference efficiency.
Table 7 presents a detailed analysis of different backbone architectures, comparing their computational requirements in terms of parameters, FLOPs, and inference time. Although ResNet-101 and Swin-S have similar parameter counts (83.1 M and 84.7 M, respectively), ResNet-101 achieves faster inference at 175.2 ms per image compared to 177.1 ms for Swin-S. The computational burden is also reflected in FLOPs, with Swin-S requiring 61.7 G operations against ResNet-101’s 60.8 G, highlighting the cost of self-attention mechanisms.
It should be emphasized that blade casting radiographic inspection is typically conducted as an offline quality control process after manufacturing completion, rather than a real-time detection task during production. In this industrial context, each blade undergoes a comprehensive post-manufacturing inspection where detection accuracy is paramount. Missed defects or false negatives can lead to catastrophic consequences in aircraft engine operation, making segmentation precision the primary concern over inference speed. The additional computational cost of transformer-based architectures is therefore acceptable and justified, as the inspection workflow permits sufficient processing time while the accuracy gains directly contribute to enhanced safety assurance. Despite these computational trade-offs, Swin-S and Swin-B strike a favorable balance between accuracy and efficiency. While they require marginally more processing time, their performance remains adequate for blade casting defect detection tasks, where segmentation accuracy is prioritized over speed.
5. Conclusions
This study introduces a novel SFCF-Net framework for automated blade casting defect segmentation in industrial radiographic inspection, addressing three critical challenges. The proposed framework integrates three key technical innovations: (1) The SCC module effectively mitigates poor image quality through a memory-guided gated mechanism that selectively filters noise and misaligned information, enabling robust delineation of defect boundaries under degraded imaging conditions. (2) The CRC module tackles defect discrimination challenges arising from intraclass variance and interclass similarity through a dual-stage attention mechanism that models intra-modal and inter-modal dependencies. (3) The AWA module captures irregular defect geometries via bidirectional rectangular windows that expand receptive fields along different orientations, addressing the constraints of conventional square window attention.
Comprehensive experiments on the ATBCD-Seg and NEU-Seg datasets demonstrate that SFCF-Net exhibits promising performance across multiple evaluation metrics, consistently outperforming existing state-of-the-art methods and meeting real-world needs for automated quality control in blade manufacturing.
Future work will focus on several directions to further advance the practical applicability. (1) Developing weakly supervised and semi-supervised learning approaches to reduce the dependency on extensive pixel-level annotations, which are extremely time-consuming and labor-intensive in radiographic inspection. (2) Exploring zero-shot and open-set learning methodologies to enable the detection of previously unseen defect categories without requiring additional labeled training data, thereby enhancing the model’s adaptability to emerging defect patterns in production environments. (3) Investigating the large foundation models to leverage their powerful representation capabilities for improved generalization across diverse casting conditions and defect manifestations. Through these advanced methods, we aim to further advance aircraft engine blade defect detection technology, ensuring aviation safety and the efficient operation of equipment.