SFCF-Net: Spatial-Frequency Synergistic Learning for Casting Defect Segmentation of Pre-Service Aircraft Engine Blades in Industrial Radiographic Inspection

Wang, Shun; Sun, Zhiying; Fang, Xifeng; Cheng, Dejun

doi:10.3390/s26051416

Open AccessArticle

SFCF-Net: Spatial-Frequency Synergistic Learning for Casting Defect Segmentation of Pre-Service Aircraft Engine Blades in Industrial Radiographic Inspection

School of Mechanical Engineering, Jiangsu University of Science and Technology, Zhenjiang 212003, China

^*

Author to whom correspondence should be addressed.

Sensors 2026, 26(5), 1416; https://doi.org/10.3390/s26051416

Submission received: 25 January 2026 / Revised: 19 February 2026 / Accepted: 20 February 2026 / Published: 24 February 2026

(This article belongs to the Section Fault Diagnosis & Sensors)

Download

Browse Figures

Versions Notes

Abstract

Turbine blades serve as critical components in aircraft engines, yet casting defects inevitably arise during manufacturing. Therefore, accurate pre-service turbine blade defect detection is critical for aircraft engine safety. However, existing deep learning-based detection methods face several challenges: poor image quality, intraclass variance, interclass similarity, and irregular defect geometries. Moreover, most existing defect detection methods rely primarily on spatial-domain features, which are insufficient for capturing fine-grained texture information, limiting their ability to discriminate complex defect patterns. To address these challenges, we propose a novel Spatial-Frequency Complementary Fusion Network (SFCF-Net) that synergistically integrates spatial and frequency-domain features through complementary cross-modal fusion for accurate defect segmentation. First, a Selective Cross-modal Calibration (SCC) module is introduced that selectively calibrates spatial-frequency features through gated cross-modal interactions, effectively preserving fine-grained details under poor image conditions. Next, we propose a Cross-modal Refinement and Complementation (CRC) module that employs dual-stage attention mechanisms to model intra- and inter-modal feature dependencies, enabling robust discrimination between similar defect categories while maintaining consistency within the same defect class. Finally, we propose an Asymmetric Window Attention (AWA) module that employs bidirectional rectangular windows for accurate defect geometric characterization. Comprehensive experiments on the Aero-engine Turbine Blade Casting Defect Segmentation (ATBCD-Seg) dataset and a public benchmark demonstrate that SFCF-Net consistently outperforms state-of-the-art methods across multiple evaluation metrics, meeting practical requirements for automated quality control in blade manufacturing.

Keywords:

casting defect detection; selective cross-modal calibration; cross-modal refinement and complementation; asymmetric window attention

1. Introduction

The casting quality of turbine blades is critical for ensuring the safe operation of aircraft engines under extreme operational conditions. Consequently, pre-service defect inspection is crucial for maintaining engine integrity and flight safety. Radiographic inspection, as shown in Figure 1, owing to its non-invasive nature, is widely employed to detect potential defects within the casting blades. However, the current inspection process depends on expert knowledge and experience, leading to inefficient and subjective assessments. This underscores the urgent need for automated defect detection in radiography images of casting blades.

Recently, numerous researchers have employed deep learning-based methods for turbine blade inspection [1,2,3,4,5] and other industrial defect detection fields [6,7,8,9]. For instance, Liu et al. [1] proposed an attention mechanism-enhanced feature fusion network based on You Only Look Once (YOLO) for wind turbine blade surface defect detection, which employs a bidirectional feature pyramid and a classification loss function with an attenuation factor to address sample imbalance. Qi et al. [2] developed a semi-supervised framework that integrates three parallel self-supervised mechanisms with a vision transformer backbone for aero-engine blade defect segmentation, which employs patch-level distortions and a class-centric loss to mitigate class imbalances with limited labeled samples. To address the challenges of tiny defects and weak features in aero-engine blade inspection, Ma et al. [3] introduced SPDP-Net, which leverages semantic prior mining to capture pixel-level defect location priors and employs a defect enhancement perception module to separate weak defects from complex backgrounds. Sun et al. [4] introduced Surface Defect Detection-Detection Transformer (SDD-DETR) by applying the detection transformer architecture to aero-engine blade surface defect detection, which employs a multi-scale deformable attention module and a lightweight feed-forward network to reduce computational complexity while maintaining high detection accuracy. Wang et al. [5] employed a pyramid vision transformer with spatial-cross overlapping embedding and dual attention mechanisms to capture fuzzy boundary features and model long-range spatial interactions for casting defect detection in radiographic images with low visibility. Beyond turbine blade inspection, deep learning techniques have also been extensively applied to other industrial defect detection domains. Liu et al. [6] developed a real-time anchor-free detector with global and local feature enhancement modules to address complex backgrounds and small-size defects, while incorporating a box refinement module to capture multi-scale defect shapes. Ma et al. [7] introduced Efficient Linear Attention (ELA) model based on YOLOv8, which incorporates linear attention mechanisms and a selective feature pyramid network to balance detection accuracy and computational efficiency for steel surface defect detection under complex industrial conditions. Yang et al. [8] proposed a lightweight detection model for internal tunnel lining defects in ground-penetrating radar images, which employs a dual attention module to achieve efficient defect recognition with reduced model size. For metal surface defect segmentation, Li et al. [9] proposed Cross-Dimensional Adaptive Region reconstruction Network (CDARNet) with a cross-dimensional adaptive region reconstruction module to suppress background noise and a contextual self-correlation semantic unification module to enhance defect localization across multiple scales.

Generally, these approaches can be categorized into two primary network architectures: detection networks and segmentation networks. Detection networks, such as the Region Convolutional Neural Network (R-CNN) series [10], YOLO series [11], and the DETR series [12], generate bounding boxes for defect localization and classification. In contrast, segmentation networks, including Fully Convolutional Networks (FCN) series [13] and U-Net [14] series, perform pixel-wise classification of defect regions, enabling precise delineation of defect location and shape. Thus, segmentation networks are increasingly utilized for industrial defect inspection tasks. Despite these advances, automated defect segmentation in blade casting faces three major challenges that significantly affect the accuracy:

(1): First, the poor image quality inherent in industrial radiography, characterized by low signal-to-noise ratios, uneven illumination, and imaging artifacts, obscures defect boundaries and compromises reliable feature extraction.
(2): Second, defects exhibit both intraclass variance and interclass similarity: the same defect type may appear vastly different under varying casting conditions, while distinct defect categories may share similar textural characteristics in challenging imaging scenarios.
(3): Third, irregular defect geometries introduce complex morphological variations, including arbitrary shapes, non-uniform size distributions, and unpredictable spatial orientations, which are difficult to capture with fixed receptive fields in conventional detection networks.

These three challenges are illustrated in Figure 2 through representative defect image examples.

Moreover, most existing segmentation models extract features solely in the spatial domain, limiting their ability to fully capture image texture information. As a result, these methods demonstrate suboptimal performance when segmenting blade casting defects in radiographic images, particularly in addressing the aforementioned three challenges.

To address the aforementioned challenges, we propose SFCF-Net, a Spatial-Frequency Complementary Fusion Network for blade casting defect segmentation. Our framework leverages the Discrete Cosine Transform (DCT), a widely used frequency domain transformation method that serves as an important complement to spatial-domain information in image processing tasks [15,16,17,18,19,20], to establish complementary relationships between spatial and frequency modalities for comprehensive texture characterization. The network architecture comprises three key components working synergistically. The Selective Cross-modal Calibration (SCC) module performs DCT-based frequency transformation and employs memory-guided gating mechanisms to selectively filter noise and misaligned information across modalities, effectively mitigating quality degradation in radiographic images. The Cross-modal Refinement and Complementation (CRC) module establishes feature correspondences through intra-modal refinement and inter-modal complementation, achieving robust discrimination amid intraclass variance and interclass similarity. The Asymmetric Window Attention (AWA) module employs bidirectional rectangular windows with varying orientations to flexibly capture irregular defect structures, overcoming the limitations of fixed square receptive fields in conventional approaches. Extensive validation on the ATBCD-Seg dataset and a public benchmark demonstrate superior performance over state-of-the-art methods. The main contributions of this work are summarized as follows:

We designed a novel SFCF-Net architecture that exploits dual-domain complementarity between spatial and frequency representations to achieve accurate blade casting defect segmentation in radiographic inspection;
The proposed SCC module employs memory-guided gated mechanisms to selectively calibrate cross-modal features by filtering noise and misaligned information, effectively suppressing imaging artifacts while preserving defect-relevant characteristics in poor imaging quality scenarios;
The proposed CRC module employs intra-modal refinement and inter-modal complementation to establish robust feature correspondences across modalities, mitigating intraclass variance while enhancing interclass separability;
The proposed AWA module employs bidirectional rectangular windows to capture complete defect morphologies with diverse aspect ratios, enabling precise characterization of irregular defect structures.

The remainder of our paper is structured as follows: Section 2 presents the related works. Section 3 describes the details of the proposed SFCF-Net architecture. As for Section 4, it presents the implementation process and experimental analysis. Finally, Section 5 concludes the paper and suggests directions for future work.

2. Related Works

2.1. Defect Detection

Defect detection has long been a fundamental task in manufacturing and quality control. Traditional methods typically rely on prior knowledge, specialized equipment, and manual expertise, including eddy current testing [21], magnetic particle testing [22], infrared thermography [23], ultrasonic testing [24], and radiographic testing [25]. However, these conventional approaches are often tailored to specific conditions and lack adaptability, which limits their effectiveness in other scenarios.

Recently, deep learning-based object detection methods have achieved remarkable performance in industrial defect detection. Su et al. [26] proposed a transformer-based model that improves hot-rolled steel strip defect detection through specialized attention mechanisms. Lei et al. [27] developed a lightweight adaptive convolution network for reliable PCB surface defect detection. Ma et al. [28] developed a few-shot detection method based on an enhanced YOLOv4 framework, which incorporates meta-learning and a multi-scale reweighting mechanism to achieve robust defect recognition on aluminum strips with limited annotated data. Liu et al. [29] introduced a global attention and cascade fusion network for steel surface defect detection, effectively addressing unstructured features and multi-scale challenges. Hong et al. [30] proposed a visual inspection-based detection model, which improves carbon fiber surface defect detection through a re-parameterized feature extractor and an optimized multi-scale fusion strategy.

Unlike object detection methods, semantic segmentation approaches delineate defect locations and areas at the pixel level, providing more detailed analysis. Zhou et al. [31] proposed a U-shaped spatial-attention transformer model for gear tooth surface defect segmentation. Guclu et al. [32] developed a dilated- and attention-aware network that achieves precise small-scale defect localization for electronic terminal surface segmentation. Qiu et al. [33] introduced a region- and edge-aware network that achieves precise boundary localization for rail surface defect segmentation. Zhou et al. [34] designed a modality evaluation network with knowledge distillation for no-service rail defect segmentation, addressing the varying quality of features derived from RGB and depth images across different scenes.

In recent years, unsupervised anomaly detection methods have been increasingly applied to industrial defect detection tasks [35]. These approaches are typically trained exclusively on normal samples and identify defects by generating binary masks during inference. Yao et al. [36] proposed a few-shot anomaly detection method that enhances edge features and refines residual representations through cascaded optimization, achieving superior detection and localization performance. Liu et al. [37] developed a frequency-aware image restoration approach that leverages high-frequency information to enhance the separation between normal and abnormal reconstructions. Wu et al. [38] developed an auto-encoder-based distillation network that achieves structural feature differentiation for unsupervised industrial anomaly detection. Chen et al. [39] proposed a boundary-guided anomaly synthesis strategy that progressively generates feature-level anomalies without auxiliary textures. Moreover, domain adaptation has emerged as an important consideration for industrial defect detection across diverse deployment environments. Fallahy and Rezazadeh [40] demonstrated that models trained on curated datasets often experience performance degradation when applied to new sites with different imaging conditions, highlighting the cross-domain generalization challenge where variations in equipment specifications or acquisition parameters can affect model reliability. However, these methods are insufficient for blade casting defect tasks, which require not only defect localization but also type classification for practical industrial applications. Accurate classification of defect types serves as a cornerstone for risk assessment and formulation of targeted remediation strategies [41]. Therefore, it is crucial to develop a specific semantic segmentation method that can effectively address the aforementioned three challenges in blade casting defect detection.

2.2. Frequency Learning in Vision Tasks

Spatial-domain methods for computer vision tasks often struggle to capture fine image details and efficiently model global dependencies. To address these limitations, some researchers have explored frequency-domain methods, which transform images into the frequency domain, process features entirely in that space, and reconstruct them via inverse transformation. Lian et al. [42] applied frequency-domain multi-scale low-pass filters to extract multi-view features from traffic images for anomaly detection. Wang et al. [43] designed a multi-scale frequency-aware self-attention module to model feature interactions in the frequency domain, improving ultrasound image segmentation accuracy. Wu et al. [44] proposed a frequency self-prompt that leverages degradation cues to adaptively refine frequency prompts for image restoration, enabling the recovery network to better eliminate input corruptions. However, pure frequency-domain methods often result in the loss of key spatial details during domain conversion, thereby reducing the accuracy of texture understanding. To mitigate this limitation, spatial-frequency methods fuse information from both spatial and frequency domains, preserving spatial detail while leveraging frequency cues for more precise texture perception. Zheng et al. [45] designed hierarchical spatial-frequency fusion transformers that jointly learn representation learning across both domains, capturing complementary information for effective feature learning in animal re-identification. Hu et al. [46] proposed a cross-spatial-frequency hybrid feature extraction network that incorporates frequency-domain processing to enhance the distinction between small objects and backgrounds. Song et al. [47] developed a frequency-prompt-guided spectral–spatial transformer for hyperspectral image classification, where the frequency prompt module facilitates spectral–spatial frequency interactions.

Given the limited information provided by spatial or frequency features alone, it is essential to establish an effective connection between the two domains for blade defect detection in order to accurately extract defect textures. To the best of our knowledge, no prior study has investigated the use of spatial frequency methods for blade defect detection. However, existing spatial-frequency approaches often deliver suboptimal results in this context, primarily because they fail to address the three key challenges inherent to blade defect characteristics. Consequently, there is a critical need for a novel spatial-frequency method specifically tailored to meet the demands of blade defect segmentation tasks.

3. Proposed Method

The overall framework of the proposed SFCF-Net is illustrated in Figure 3a. It adopts a hierarchical encoder–decoder architecture designed to exploit the complementarity between spatial and frequency domains for accurate blade casting defect segmentation.

Encoder: We employ the Swin Transformer (STAM) [48] as the encoder backbone to extract multi-scale spatial-domain features. The encoder consists of four hierarchical stages, generating features

M_{i}^{s}

(

i \in \{1, 2, 3, 4\}

) at progressively decreasing resolutions, providing rich multi-scale representations for subsequent processing.

Frequency-guided decoder: The designed decoder comprises four progressive stages that perform hierarchical feature refinement and spatial-frequency synergistic learning. As shown in Figure 3b, in the decoder, we first transform spatial features

M_{i}^{s}

into the frequency domain, then recalibrate feature representations using the SCC module, which considers the complementarity of spatial and frequency domains, to obtain the recalibrated features

M_{i}^{r f}

and

M_{i}^{r s}

. Subsequently, the CRC module computes spatial-frequency feature affinities to obtain an aggregated cross-modal feature

M_{i}^{c}

. To capture global contextual dependencies of irregular defect geometries, the AWA module applies shape-adaptive rectangular window attention to characterize defect structures, yielding

M_{i}^{g}

. Finally, to facilitate effective cross-modal feature interaction across different hierarchical levels and enable comprehensive information exchange between cross-modal patterns and global spatial context, we fuse

M_{i}^{c}

and

M_{i}^{g}

to obtain predictive features

M_{i}^{p}

, which facilitates the generation of more accurate segmentation results. This overall process can be formally defined as:

M_{i}^{p} = C o n c a t (M_{i}^{c}, R e l u (L N (M_{i}^{c} ⊖ M_{i}^{g}))), i \in \{1, 2, 3, 4\}

(1)

where

R e l u (\cdot)

and

L N (\cdot)

functions represent the ReLU activation function and Layer Normalization (LN), respectively,

C o n c a t (\cdot)

indicates the channel concatenation operation, whereas ⊖ denotes the element-wise subtraction.

To optimize the proposed SFCF-Net, a supervised learning strategy is applied to defect maps across different hierarchical levels. Specifically, we compute cross-entropy loss

L_{c}

between the predicted semantic maps

P_{i}

and ground truth maps

G_{i}

at each of the four stages. The total loss function

L_{t}

is obtained by accumulating the cross-entropy losses for each

P_{i}

, which can be formulated as follows:

L_{t} = \sum_{i = 1}^{4} L_{c} (P_{i}, G_{i})

(2)

This multi-level supervision strategy enables the network to learn defect representations at different semantic scales, facilitating more effective feature learning.

3.1. Selective Cross-Modal Calibration

To address the challenge of poor image quality in blade casting defect segmentation, we propose the Selective Cross-modal Calibration (SCC) module. This module consists of two sequential stages: frequency transformation (Section 3.1.1) and cross-modal calibration (Section 3.1.2). The frequency transform stage converts spatial-domain features into the frequency domain using DCT, while the cross-modal calibration stage employs memory-guided gated mechanisms to selectively filter noise and misaligned information, adaptively recalibrating feature representations to preserve fine-grained defect details under poor imaging conditions.

3.1.1. Frequency Transform

The process of frequency transformation is illustrated in Figure 4. This module first converts spatial-domain features into the frequency domain using the DCT. The DCT can be expressed as follows:

F (u, v) = c (u) c (v) \sum_{x = 0}^{N - 1} \sum_{y = 0}^{N - 1} f (x, y) \cos \frac{(2 x + 1) u π}{2 N} \cos \frac{(2 y + 1) v π}{2 N}

(3)

where

F (u, v)

represents the spectrogram with frequency components

u, v \in \{0, 1, \dots, n - 1\}

in horizontal and vertical directions. The image block size is N, and

f (x, y)

denotes spatial positions with

x, y \in \{0, 1, \dots, N - 1\}

. Moreover,

c (\cdot)

is the compensation coefficient with value

\sqrt{\frac{1}{N}}

if

u = 0

, and

\sqrt{\frac{2}{N}}

otherwise.

In this study, to maintain more detailed texture information of radiographic images, the input feature maps

M_{i}^{s}

are partitioned into patches

\in ℝ^{n \times C \times M \times M}

(where n is the total number of patches, C represents the channel dimension, and M × M denotes the size of a patch) as represented in the image block in Equation (3). Next, DCT is performed at the patch level to generate the frequency-domain features

M_{i}^{f}

. This process is expressed as follows:

M_{i}^{f} = φ (D C T^{j} (φ^{- 1} (M_{i}^{s})))

(4)

where

φ (\cdot)

and

φ^{- 1} (\cdot)

represent fold and unfold operations, respectively, whereas j

\in \{0, 1, \dots, n - 1\}

denotes the frequency coefficient.

3.1.2. Cross-Modal Calibration

Although the frequency-domain features obtained by frequency transform can be a supplement of the spatial-domain features, the two modal features still contain a substantial amount of redundant and noisy information due to the low image quality, which may introduce interference to the cross-modal interaction signals, ultimately diminishing the impact of cross-modal synergy on the final segmentation results. To address this problem, we calibrate cross-modal features through a memory-guided gated mechanism.

As shown in Figure 5, we reassign and reactivate the modal-specific representations (

M_{i}^{s}

and

M_{i}^{f}

) jointly generated by the encoder and frequency transform. In order to enable the model to fully explore the low-quality image textures from the modal-specific information, the SCC module incorporates a memory vector and a forget gate to selectively filter out noise and misaligned information. Specifically,

M_{i}^{s}

and

M_{i}^{f}

are first transformed with the 1 × 1 convolutional layer and linear layer, which projects the multi-modal features into a lower-dimensional latent space, obtaining

V_{i}^{s}

and

V_{i}^{f}

:

V_{i}^{s} = L i n e a r (C o n v (M_{i}^{s}))

(5)

V_{i}^{f} = L i n e a r (C o n v (M_{i}^{f}))

(6)

where

L i n e a r (\cdot)

and

C o n v (\cdot)

represent a linear layer and a 1 × 1 convolutional layer, respectively. Then

V_{i}^{s}

and

V_{i}^{f}

are fed into the similarity function

ψ (\cdot)

to calculate the cosine similarity matrix, which represents the similarity score map between modalities. From this interaction, the spatial-oriented remember vector

V_{i}^{s r}

is generated to capture essential cross-modal information by re-weighting the spatial features based on their correlation with the frequency domain:

V_{i}^{s r} = F_{S o f t M a x} ((\frac{V_{i}^{s} \cdot V_{i}^{f}}{| | V_{i}^{s} | |_{2} \cdot | | V_{i}^{f} | |_{2}})) \otimes V_{i}^{s}

(7)

where

\otimes

represents the matrix multiplication, and

F_{S o f t M a x} (\cdot)

denotes the SoftMax operation. To regulate the influx of noise and mismatched information, a forget vector

V_{i}^{s f}

is constructed by applying a sigmoid activation function to a linear transformation of

V_{i}^{s r}

:

V_{i}^{s f} = L i n e a r (δ (V_{i}^{s r}))

(8)

where

δ (\cdot)

represents the sigmoid activation function. This forget vector acts as a gate to selectively filter out redundant counterparts that might interfere with crucial cross-modal signals. Finally, the recalibrated spatial-domain feature

M_{i}^{r s}

formed by the gated fusion of

V_{i}^{s f}

and

V_{i}^{s}

, which prevents the forget gate from excessively filtering out valuable information:

M_{i}^{r s} = C o n v (V_{i}^{s f} \cdot V_{i}^{s r} + V_{i}^{s})

(9)

Similarly, the recalibrated frequency-domain features

M_{i}^{r f}

can be derived through a symmetric procedure, where the frequency-domain features are refined under the guidance of spatial correlations, ensuring a robust mutual enhancement between the two domains.

3.2. Cross-Modal Refinement and Complementation Module

Blade casting defects in radiographic images exhibit intraclass variance and interclass similarity, complicating the aggregation of features with diverse appearances within the same category and the discrimination between visually similar categories. Some researchers have tried to address this challenge in the industrial defect detection scenarios [49,50,51]. However, these approaches demonstrate limitations in blade casting defect segmentation. First, radiographic image degradation limits the effectiveness of spatial-domain feature extraction and discriminative modeling. Second, existing methods operate exclusively within spatial representations, overlooking complementary information in other modalities.

To address these challenges, the CRC module is designed, as illustrated in Figure 6. The CRC employs a two-stage design. In the first stage, Intra-modal Refinement (IaR) layer establishes intra-modal feature dependencies to mitigate intraclass variance by strengthening semantic consistency among features of the same defect category. In the second stage, the Inter-modal Complementation (IeC) layer establishes cross-modal correspondences between spatial and frequency domains, exploiting their complementary discriminative strengths to enhance interclass separability. Specifically, the linear mappings are first performed on the input

M_{i}^{r s}

and

M_{i}^{r f}

to generate the query, key, and value of the IaR layer. This process can be formulated as:

\{\begin{cases} Q_{i}^{f} = W_{i}^{f - q} \cdot ℘ (L N (M_{i}^{r f})), Q_{i}^{s} = W_{i}^{s - q} \cdot ℘ (L N (M_{i}^{r s})) \\ K_{i}^{f} = W_{i}^{f - k} \cdot ℘ (L N (M_{i}^{r f})), K_{i}^{s} = W_{i}^{s - k} \cdot ℘ (L N (M_{i}^{r s})) \\ V_{i}^{f} = W_{i}^{f - v} \cdot ℘ (L N (M_{i}^{r f})), V_{i}^{s} = W_{i}^{s - v} \cdot ℘ (L N (M_{i}^{r s})) \end{cases}

(10)

where

W_{i}^{f - q}

,

W_{i}^{s - q}

,

W_{i}^{f - k}

,

W_{i}^{s - k}

,

W_{i}^{f - v}

and

W_{i}^{s - v}

are the mapping matrixes of the two modalities.

℘ (\cdot)

represents the window partition layer. Then, the semantic feature correlations within each modality are modeled within windows, obtaining

I a R_{i}^{f}

and

I a R_{i}^{s}

:

\{\begin{cases} I a R_{i}^{f} = (F_{S o f t M a x} (Q_{i}^{f} \otimes {(K_{i}^{f})}^{T} / \sqrt{d}) + B) \otimes V_{i}^{f} + ℘ (L N (M_{i}^{r f})) \\ I a R_{i}^{s} = (F_{S o f t M a x} (Q_{i}^{s} \otimes {(K_{i}^{s})}^{T} / \sqrt{d}) + B) \otimes V_{i}^{s} + ℘ (L N (M_{i}^{r s})) \end{cases}

(11)

where d is the dimension of query and key vectors in the attention layer, B denotes the relative position bias. Similarly, the query, key and value of IeC layer can be obtained by the linear mapping:

\{\begin{cases} Q_{i}^{e r - f} = W_{i}^{e r f - q} \cdot I a R_{i}^{f}, Q_{i}^{e r - s} = W_{i}^{e r s - q} \cdot I a R_{i}^{s} \\ K_{i}^{e r - f} = W_{i}^{e r f - k} \cdot I a R_{i}^{f}, K_{i}^{e r - s} = W_{i}^{e r s - k} \cdot I a R_{i}^{s} \\ V_{i}^{e r - f} = W_{i}^{e r f - v} \cdot I a R_{i}^{f}, V_{i}^{e r - s} = W_{i}^{e r s - v} \cdot I a R_{i}^{s} \end{cases}

(12)

After that, we compute the cross-modal attention to exploit the complementary information between spatial and frequency domains:

\{\begin{cases} I e A_{i}^{f \to s} = (F_{S o f t M a x} (Q_{i}^{e r - f} \otimes {(K_{i}^{e r - s})}^{T} / \sqrt{d}) + B) \otimes V_{i}^{e r - s} + I a R_{i}^{f} \\ I e A_{i}^{s \to f} = (F_{S o f t M a x} (Q_{i}^{e r - s} \otimes {(K_{i}^{e r - f})}^{T} / \sqrt{d}) + B) \otimes V_{i}^{e r - f} + I a R_{i}^{s} \end{cases}

(13)

where

f \to s

and

s \to f

represent the information interaction across modalities. Finally, we use the convolutional fusion layer to fuse

I e A_{i}^{f \to s}

and

I e A_{i}^{s \to f}

, with the fused features denoted as

M_{i}^{c}

. This process can be expressed as follows:

M_{i}^{c} = C o n c a t (C o n v (℘^{- 1} (I e A_{i}^{f \to s})), C o n v (℘^{- 1} (I e A_{i}^{s \to f})))

(14)

where

℘^{- 1} (\cdot)

denotes the window reverse layer.

3.3. Asymmetric Window Attention

Blade casting defects exhibit highly irregular geometric characteristics with arbitrary shapes, diverse aspect ratios, and unpredictable spatial orientations. Traditional convolutional methods [52,53,54,55] with fixed square receptive fields struggle to capture the complete morphological structures of such defects, particularly for defects with random aspect ratios or varying sizes. To address this challenge, we propose the AWA module, which employs bidirectional rectangular windows to flexibly capture irregular defect geometries and expand receptive fields along different orientations.

The architecture of the AWA module is illustrated in Figure 7. Let

M_{i}^{s}

denote the input feature at the i-th decoder stage. The AWA module partitions the feature map into two types of non-overlapping rectangular windows processed in parallel through separate attention branches: horizontal windows (H-W) with size m × n, where m < n, designed to capture horizontally elongated defect patterns, and vertical windows (V-W) with size m × n, where m > n, designed to model vertically elongated defect structures.

For the horizontal window branch, the query

Q_{i}^{h}

, key

K_{i}^{h}

, and value

V_{i}^{h}

are computed through linear projections. Similarly, the vertical window branch computes

Q_{i}^{h}

,

K_{i}^{v}

, and

V_{i}^{v}

. Within each window, the attention operation is performed as:

\{\begin{cases} A t t_{i}^{H} = F_{R e L U^{2}} ((Q_{i}^{h} \otimes {(K_{i}^{h})}^{T} / \sqrt{d})) \otimes V_{i}^{h} \\ A t t_{i}^{V} = F_{R e L U^{2}} ((Q_{i}^{v} \otimes {(K_{i}^{v})}^{T} / \sqrt{d})) \otimes V_{i}^{v} \end{cases}

(15)

where

A t t_{i}^{H}

and

A t t_{i}^{V}

represent the output of the horizontal and vertical window attention branches, respectively.

F_{R e L U^{2}} (\cdot)

denotes the squared ReLU activation function. Unlike the Softmax function, the squared ReLU redistributes attention weights across all spatial positions; the squared ReLU acts as a hard filter to suppress low-relevance regions. After attention computation, the features are projected and reversed back to the original spatial arrangement. The outputs from both branches are then concatenated to produce the enhanced feature

M_{i}^{g}

:

M_{i}^{g} = C o n c a t (℘^{- 1} (p r o (A t t_{i}^{H})), ℘^{- 1} (p r o (A t t_{i}^{V})))

(16)

where

p r o (\cdot)

denotes the linear projection layer.

To accommodate different feature resolutions across decoder stages, the window sizes are progressively adjusted following a stage-specific configuration. Specifically, for high-resolution stages (stages 1–2), larger window sizes are employed: 8 × 64 for horizontal windows and 64 × 8 for vertical windows at stage 1, and 4 × 32 for horizontal windows and 32 × 4 for vertical windows at stage 2. Conversely, for low-resolution stages (stages 3–4), smaller windows are utilized: 2 × 16 and 16 × 2 at stage 3, and 1 × 8 and 8 × 1 at stage 4, maintaining appropriate receptive field coverage relative to feature resolution. Unlike squared windows that restrict the attention area uniformly, bidirectional rectangular windows expand the receptive field to capture more textures of defect shape along specific orientations. Through this design, the AWA module effectively captures complete defect morphologies with diverse aspect ratios and spatial orientations, enabling precise characterization of irregular defect structures.

4. Experiments and Results

To validate the effectiveness of our proposed approach, we conduct comprehensive experiments on two datasets: ATBCD-Seg and NEU-Seg. This section begins with an overview of the datasets, evaluation metrics, and implementation details, followed by detailed experimental results and analysis.

4.1. Datasets Description

ATBCD-Seg dataset: The ATBCD-Seg dataset was constructed by digitizing original aero-engine turbine blade photographs acquired through traditional film photography using an industrial digital film scanner, as shown in Figure 1. The dataset contains 1200 individual turbine blade defect images and comprises two common casting defect categories observed in blade manufacturing: Slag Inclusion (SI) and Redundancy (Re), with a balanced class distribution ratio of 1:1. Defects were meticulously labeled under the guidance of experts from turbine blade manufacturers. The dataset was split into training, validation, and test sets using a 7:2:1 ratio at the original image level. To increase data diversity, each blade defect image was then cropped nine times at a resolution of 224 × 224, placing the defect at the center and at eight surrounding positions. This preprocessing resulted in 7560 training images, 2160 validation images, and 1080 test images. Figure 8 presents the image data processing pipeline and representative defect images from the ATBCD-Seg dataset, with all nine crops from each original image remaining within the same split to prevent data leakage.

NEU-Seg dataset: To further evaluate the generalization of the proposed method, SFCF-Net was tested on the NEU-Seg dataset [56], representing additional industrial inspection scenarios. The dataset contains 300 images (200 × 200 resolution) for each of the three typical surface defects in hot-rolled steel bars: Patches (Pa), Inclusions (In), and Scratches (Sc). The dataset was divided into training, validation, and test sets with a 7:2:1 ratio. Figure 9 presents some example images in the NEU-Seg dataset; the problems of intraclass variance, interclass similarity, and irregular defect geometries also exist in this dataset.

4.2. Evaluation Metrics

To evaluate the proposed method along with other state-of-the-art approaches, we compute Intersection-over-Union (IoU), Accuracy (Acc), Dice coefficient, Precision, and Recall for each defect type, and subsequently calculate the overall mean Intersection-over-Union (mIoU), mean Accuracy (mAcc), mean Dice (mDice), mean Precision (mPrecision), and mean Recall (mRecall). These formulas are as follows:

m I o U = \frac{1}{k + 1} \sum_{i = 0}^{k} \frac{T P_{i}}{T P_{i} + F P_{i} + F N_{i}}

(17)

m A c c = \frac{1}{k + 1} \sum_{i = 0}^{k} \frac{T P_{i} + T N_{i}}{T P_{i} + T N_{i} + F P_{i} + F N_{i}}

(18)

m D i c e = \frac{1}{k + 1} \sum_{i = 0}^{k} \frac{2 T P_{i}}{2 T P_{i} + F P_{i} + F N_{i}}

(19)

mPrecision = \frac{1}{k + 1} \sum_{i = 0}^{k} \frac{T P_{i}}{T P_{i} + F P_{i}}

(20)

mRecall = \frac{1}{k + 1} \sum_{i = 0}^{k} \frac{T P_{i}}{T P_{i} + F N_{i}}

(21)

where

k + 1

denotes the total number of classes (including the background), and TP, FP, and FN represent True Positives, False Positives, and False Negatives, respectively.

Furthermore, Foreground-Background IoU (FBIoU), which ignores class information, is employed to assess the model’s ability to distinguish defect regions from normal background. Here, the foreground is formed by the union of all defect class pixels (Re and SI), while the background represents non-defect regions. This can be expressed as follows:

FBIoU = \frac{1}{2} ({IoU}_{f} + {IoU}_{b})

(22)

where

{IoU}_{f}

and

{IoU}_{b}

denote the foreground IoU and background IoU, respectively.

4.3. Implementation Details

In this study, experiments were conducted on a workstation equipped with an Intel Xeon Gold 6226R CPU @ 2.90 GHz, 128 GB Samsung DDR4 RAM, and an NVIDIA GeForce RTX 3090 GPU. The software environment consisted of Python 3.8.5, PyTorch 1.13.1 with CUDA 11.7, and Torchvision 0.14.1. For training, we employed the AdamW optimizer with an initial learning rate and weight decay of 10⁻⁴. A warmup strategy was applied for the first 200 iterations, followed by cosine annealing decay. The batch size was set to eight, and the input size was set to 224 × 224 pixels. The model was validated after each epoch. Training was conducted for a maximum of 500 epochs, with early stopping applied when the validation Dice score showed no improvement for 50 consecutive epochs. Moreover, we designed two models with different parameter quantities: SFCF-Net-S, adopting the small version of STAM and SFCF-Net-B, using the base version of STAM; both models were pre-trained on ImageNet.

4.4. Comparative Experiments on ATBCD-Seg and NEU-Seg Datasets

4.4.1. Quantitative Comparison

To verify the effectiveness of the proposed method, we compared SFCF-Net with a variety of state-of-the-art segmentation models, including CNN-based methods (FCN [13], U-Net [14], Deeplabv3 [57], Deeplabv3+ [58]), transformer-based methods (Segnext [59], Segformer [60]), and hybrid architectures (Upernet [61], Psanet [62], Pspnet [63], Gcnet [64], Icnet [65], Stdc [66]). Additionally, we compare against the recent defect-specific method CSEPNet [67], CDARNet [9], LGGFormer [68], SPDP-Net [3], REA-Net [33], and CPDNet [69], as well as the general-purpose Segment Anything Model (SAM) [70]. For SAM, we employed the pretrained ViT-B as an image encoder for fine-tuning. The training prompts were derived from the samples: box prompts consist of the top three minimal bounding boxes covering the defect masks, while point prompts are randomly selected as five points within each mask. To ensure fair comparison, all baseline methods were trained with the same training schedule as the proposed SFCF-Net, including input image size (224 × 224), pretrained backbone initialization, and other training configurations.

The quantitative results on the ATBCD-Seg dataset are presented in Table 1. It can be observed that SFCF-Net (b) achieves the best performance across all evaluation metrics, reaching 88.39% mIoU and 93.59% mAcc. Meanwhile, our lightweight variant SFCF-Net (s) also demonstrates competitive performance with 87.14% mIoU and 92.61% mAcc. Compared with classic methods, SFCF-Net (b) outperforms FCN by 22.91% in mIoU and U-Net by 20.14%, demonstrating the effectiveness of spatial-frequency feature representation for capturing defect textures under poor imaging conditions. Among advanced methods, Deeplabv3 and Deeplabv3+ achieve mIoU of 81.80% and 82.13%, respectively, while SFCF-Net (b) surpasses them by 6.59% and 6.26%. Transformer-based methods such as Segnext (79.02%) and Segformer (80.59%) show inferior performance, suggesting that self-attention alone is insufficient without frequency-domain information. Notably, when compared with SAM, SFCF-Net (b) achieves 2.97% improvement in mIoU (from 85.42% to 88.39%) and 2.24% in mAcc (from 91.35% to 93.59%). The improvement is particularly significant for individual defect categories, where SFCF-Net (b) surpasses SAM by 5.60% for SI defect (from 79.19% to 84.79%) and 3.26% for Re defect (from 77.34% to 80.60%). The class-wise performance reveals that SI defects generally achieve higher IoU than Re defects across all methods. For instance, FCN obtains 65.67% for SI but only 31.92% for Re. In contrast, SFCF-Net (b) achieves more balanced performance with 84.79% and 80.60%, respectively, demonstrating a smaller performance gap (4.19%) compared to FCN (33.75%). This validates the effectiveness of our CRC module in handling intraclass variance and interclass similarity.

Moreover, Table 2 presents results on the NEU-Seg dataset. SFCF-Net (b) achieves the best performance with 86.87% mIoU and 97.92% mAcc, outperforming SAM by 0.84% and Segnext by 0.46%. When analyzing class-wise IoU, SFCF-Net (b) demonstrates more balanced performance across defect types (86.54% for Pa defect, 83.45% for Sc defect, 79.61% for In defect) compared to other methods. These results validate the robust generalization capability of our approach across diverse industrial scenarios. The performance gap of mIoU between SFCF-Net (b) and SAM is larger on ATBCD-Seg (2.97%) than on NEU-Seg (0.84%), indicating that spatial-frequency representations are more effective under degraded radiographic imaging conditions, which validates our design rationale of incorporating frequency-domain information for blade casting defect segmentation.

4.4.2. Qualitative Comparison

To evaluate the effectiveness of the proposed SFCF-Net, we visualize segmentation results on the ATBCD-Seg and NEU-Seg datasets in Figure 10 and Figure 11, respectively. For clarity, we analyze representative cases that highlight the distinctive challenges in defect segmentation and demonstrate how our approach addresses these challenges.

Qualitative comparisons on the ATBCD-Seg dataset are shown in Figure 10. Each column represents a defect type, and each row, from top to bottom, represents the original image, Ground Truth (GT), and the results of the compared methods. In the first column, several methods fail to accurately locate the boundaries of the Re defect, resulting in incomplete over-segmentation. This boundary ambiguity issue is common in radiographic images where defect edges gradually fade into the background. In contrast, our method accurately captures the defect boundaries by leveraging the SCC module, which preserves high-frequency edge information through frequency-domain features. In the second column, uneven brightness and low contrast hinder the complete segmentation of the defective region. For example, Pspnet produces incomplete Re defect segmentation, while U-Net and Deeplabv3+ erroneously split a single defect into multiple segments. These segmentation errors demonstrate their vulnerability to intensity variations and can lead to incorrect defect counting in practical applications. In contrast, our method incorporates the frequency domain to capture additional defect textures, reducing segmentation errors. In the third and fourth columns, the SI defects exhibit heterogeneous intra-class patterns, further complicating segmentation. For example, Deeplabv3 misclassifies a portion of the SI defect as Re in the third column, and Icnet misclassifies the background as Re in the fourth column. These errors indicate insufficient discriminative capability between visually similar defect categories. By leveraging frequency information and modeling the relationship between spatial and frequency-domain features through our CRC module, our method accurately distinguishes defect-free and defect regions, achieving more precise segmentation.

Qualitative comparisons on the NEU-Seg dataset are presented in Figure 11. In the first row, our method demonstrates superior capability in describing fine defect patterns and intricate details through the AWA module, which leverages rectangle window attention to capture geometric dependencies. In the second row, many competing methods fail to completely cover the actual defect area. However, our method maintains segmentation completeness by capturing the fuzzy boundary features through SCC module. Meanwhile, as illustrated in the third row, other methods are prone to background misclassification. In contrast, our method effectively suppresses background interference through cross-modal feature refinement and compensation in the CRC module, enabling accurate distinction between defect and non-defect regions. These visual comparisons complement the quantitative results and demonstrate the practical effectiveness of our method in real industrial inspection scenarios.

4.5. Ablation Study

To assess the effectiveness of individual components and the overall framework, we conduct a comprehensive ablation study on the ATBCD-Seg dataset.

4.5.1. Effectiveness of Different Modules

To verify the effectiveness of each designed module, we conduct a series of ablation experiments on the ATBCD-Seg dataset. The baseline model is constructed by removing the SCC, CRC, and AWA modules while retaining the rest of the network architecture. We then progressively incorporate the proposed modules to evaluate their individual and collective contributions to segmentation performance. The ablation results are presented in Table 3.

As shown in Table 3, when incorporating the SCC module into the baseline, the performance improves substantially, with mIoU increasing by 4.66% to 81.84%. This validates the effectiveness of introducing frequency-domain features and performing progressive cross-modal recalibration, which enhances defect boundary localization under poor imaging conditions. The combination of SCC and CRC modules further boosts the performance to 83.89% mIoU, demonstrating a 2.05% improvement over SCC alone. This indicates that establishing cross-modal feature correspondences through intra- and inter-modal alignment significantly enhances the model’s capability to distinguish subtle defect variations while maintaining robustness to appearance inconsistencies. Similarly, combining SCC with AWA achieves 83.13% mIoU, demonstrating the effectiveness of dilated attention in capturing irregular defect geometries. When combining CRC and AWA modules without frequency-domain features, the model achieves 82.47% mIoU, which is lower than any SCC-based configuration. This result underscores the critical role of frequency-domain information in blade casting defect segmentation, without which the model struggles to fully characterize defect textures. This validates our design rationale that frequency-domain features are essential for addressing the challenge of poor image quality in radiographic inspection. The optimal performance is achieved with the full SFCF-Net, which integrates all three proposed modules, reaching 88.39% mIoU, 92.34% mRecall, 93.41% mDice, 93.22% mPrecision, and 90.27% FBIoU. Compared to the baseline, the complete framework demonstrates improvements of 11.21% in mIoU, 2.53% in mRecall, 8.10% in mDice, 11.71% in mPrecision, and 10.22% in FBIoU. These results evidence the high effectiveness of our method in capturing detailed defect features under challenging visual conditions. To provide a more intuitive representation of the ablation study results, Figure 12 illustrates the same performance metrics using a radial bar chart visualization, where the baseline with SCC is denoted as Model 1, the baseline with SCC and CRC as Model 2, the baseline with SCC and AWA as Model 3, and the baseline with CRC and AWA as Model 4.

4.5.2. Effectiveness of Frequency-Domain Features in SCC Module

To investigate the effectiveness of the introduced frequency-domain features in the SCC module and the contribution of different frequency components in defect feature learning, our method leverages the complete frequency spectrum, including both low-frequency and high-frequency information, to comprehensively capture defect texture features in blade casting radiographic images. We conduct comparative experiments using three configurations: utilizing only low-frequency components by exploiting the first half of the frequency spectrum (denoted as Low), using only high-frequency components from the second half of the spectrum (denoted as High), and leveraging all frequency components across the complete spectrum (denoted as Full).

As shown in Table 4, the model employing all frequency information achieves optimal performance across all evaluation metrics (mIoU: 88.39%, mAcc: 93.59%, mPre: 93.22%, FBIoU: 90.27%), which demonstrates the effectiveness of integrating both low and high-frequency information. Notably, the configuration utilizing only low-frequency information exhibits the poorest performance, with substantial degradation of approximately 6.24% in mIoU compared to the full spectrum approach. This observation can be explained by the fact that low-frequency components primarily encode coarse structural information and smooth intensity variations, which are insufficient for precisely delineating defect boundaries and capturing subtle texture anomalies present in radiographic images. In contrast, the high-frequency configuration achieves significantly better results compared to the low-frequency variant, as high-frequency components encapsulate critical boundary information and fine-grained textural patterns that are essential for accurate defect localization. However, relying exclusively on high-frequency information also leads to suboptimal results, with a performance gap of 1.66% in mIoU compared to the full spectrum, suggesting that low-frequency components provide necessary contextual cues that are crucial for robust segmentation.

To further demonstrate the importance of high-frequency information and the complementary nature of different frequency bands, we visualize the heat maps, as illustrated in Figure 13. From left to right, we represent the original image, GT, and the heat maps of different frequency components. The low-frequency configuration produces diffuse activation patterns with imprecise spatial localization, while the high-frequency variant shows improved edge sensitivity but generates scattered and discontinuous activation regions with incomplete defect coverage. In contrast, by utilizing the complete frequency spectrum, our model effectively focuses on irregular defect morphologies while suppressing background interference. The full-spectrum approach successfully integrates fine-grained boundary details from high-frequency components with global structural context from low-frequency components, enabling the network to generate spatially coherent predictions that accurately capture complex defect geometries in radiographic images.

4.5.3. Effectiveness of Rectangular Window Design in AWA Module

The AWA module employs bidirectional rectangular windows to expand receptive fields along different orientations for capturing irregular defect geometries. To validate this design, we compare five window configurations. Square windows employ a uniform 8 × 8 size across all stages. Horizontal windows (H-W) use progressively decreasing sizes of 8 × 64, 4 × 32, 2 × 16, and 1 × 8 for stages 1–4, while Vertical windows (V-W) use sizes of 64 × 8, 32 × 4, 16 × 2, and 8 × 1 for the corresponding stages. Fixed-Bidirectional (F-B) maintains constant sizes of 8 × 64 and 64 × 8 across all stages, whereas Progressive-Bidirectional (P-B) combines both orientations at each stage with sizes (8 × 64, 64 × 8), (4 × 32, 32 × 4), (2 × 16, 16 × 2), and (1 × 8, 8 × 1) for stages 1–4, respectively. The ablation results are presented in Table 5.

As shown in Table 5, Square windows achieve the poorest performance (mIoU: 85.73%), as uniform receptive fields fail to capture elongated defect structures. Horizontal (86.72% mIoU) and Vertical (86.21% mIoU) configurations show improvements but exhibit asymmetric performance: Horizontal achieves 79.25% IoU on Re defects while Vertical obtains 82.89% on SI defects, indicating orientation-specific limitations. This validates that single-orientation windows cannot comprehensively handle defects with diverse aspect ratios. F-B demonstrates the benefit of parallel processing (87.45% mIoU, 83.67% IoU for Re), but fixed large windows lead to suboptimal feature extraction across different resolutions. In contrast, P-B achieves optimal performance (88.39% mIoU, 93.59% mAcc, 90.27% FBIoU), improving mIoU by 0.94% over F-B, with more balanced performance across defect types (Re: 80.60%, SI: 84.79%). The progressive window size reduction enables scale-adaptive feature extraction: larger windows in early stages capture global context, while smaller windows in later stages preserve fine-grained details, enabling robust characterization of irregular defect geometries.

To provide a more intuitive representation of these results, Figure 14 visualizes the performance metrics, demonstrating the superior balance achieved by our P-B design across different evaluation dimensions.

4.5.4. Effectiveness of Feature Combination Strategies

To establish effective connections between cross-modal and global information, we extract two complementary feature representations: the cross-modal feature

M_{i}^{c}

from the CRC module and the global feature

M_{i}^{g}

from the AWA module, as described in Section 3. These features facilitate comprehensive information exchange between local cross-modal patterns and global spatial context. Based on these representations, we explore three fusion strategies to optimally combine these heterogeneous features. As illustrated in Figure 15, the first strategy employs parallel fusion, denoted as P, where both feature types are processed simultaneously and concatenated along the channel dimension. The two other strategies use sequential fusion with different processing orders:

S_{c}

processes cross-modal features first and then incorporates global features via element-wise addition, while

S_{g}

processes global features first and subsequently integrates cross-modal features via element-wise addition. As reported in Table 6, the parallel fusion configuration consistently outperforms both sequential alternatives, suggesting that simultaneous processing of cross-modal and global information is more effective than sequential integration. Accordingly, we adopt the parallel fusion strategy throughout our experimental framework.

4.5.5. Effectiveness of Different Backbone Networks

To investigate the influence of backbone architecture on model performance, we compare several CNN-based and transformer-based backbones. As shown in Table 7, CNN-based backbones achieve less favorable performance compared to transformer-based backbone networks (Pyramid Vision Transformer (PVT) [71] and STAM). This difference can be attributed to the transformer’s ability to capture long-range dependencies and global context, which is critical for blade defect segmentation under challenging radiographic imaging conditions. In contrast, CNN-based backbones, relying on local receptive fields, are prone to segmentation errors.

However, transformer-based backbones introduce considerable computational complexity due to their self-attention mechanisms for capturing long-range dependencies, resulting in reduced inference efficiency. Table 7 presents a detailed analysis of different backbone architectures, comparing their computational requirements in terms of parameters, FLOPs, and inference time. Although ResNet-101 and Swin-S have similar parameter counts (83.1 M and 84.7 M, respectively), ResNet-101 achieves faster inference at 175.2 ms per image compared to 177.1 ms for Swin-S. The computational burden is also reflected in FLOPs, with Swin-S requiring 61.7 G operations against ResNet-101’s 60.8 G, highlighting the cost of self-attention mechanisms.

It should be emphasized that blade casting radiographic inspection is typically conducted as an offline quality control process after manufacturing completion, rather than a real-time detection task during production. In this industrial context, each blade undergoes a comprehensive post-manufacturing inspection where detection accuracy is paramount. Missed defects or false negatives can lead to catastrophic consequences in aircraft engine operation, making segmentation precision the primary concern over inference speed. The additional computational cost of transformer-based architectures is therefore acceptable and justified, as the inspection workflow permits sufficient processing time while the accuracy gains directly contribute to enhanced safety assurance. Despite these computational trade-offs, Swin-S and Swin-B strike a favorable balance between accuracy and efficiency. While they require marginally more processing time, their performance remains adequate for blade casting defect detection tasks, where segmentation accuracy is prioritized over speed.

5. Conclusions

This study introduces a novel SFCF-Net framework for automated blade casting defect segmentation in industrial radiographic inspection, addressing three critical challenges. The proposed framework integrates three key technical innovations: (1) The SCC module effectively mitigates poor image quality through a memory-guided gated mechanism that selectively filters noise and misaligned information, enabling robust delineation of defect boundaries under degraded imaging conditions. (2) The CRC module tackles defect discrimination challenges arising from intraclass variance and interclass similarity through a dual-stage attention mechanism that models intra-modal and inter-modal dependencies. (3) The AWA module captures irregular defect geometries via bidirectional rectangular windows that expand receptive fields along different orientations, addressing the constraints of conventional square window attention.

Comprehensive experiments on the ATBCD-Seg and NEU-Seg datasets demonstrate that SFCF-Net exhibits promising performance across multiple evaluation metrics, consistently outperforming existing state-of-the-art methods and meeting real-world needs for automated quality control in blade manufacturing.

Future work will focus on several directions to further advance the practical applicability. (1) Developing weakly supervised and semi-supervised learning approaches to reduce the dependency on extensive pixel-level annotations, which are extremely time-consuming and labor-intensive in radiographic inspection. (2) Exploring zero-shot and open-set learning methodologies to enable the detection of previously unseen defect categories without requiring additional labeled training data, thereby enhancing the model’s adaptability to emerging defect patterns in production environments. (3) Investigating the large foundation models to leverage their powerful representation capabilities for improved generalization across diverse casting conditions and defect manifestations. Through these advanced methods, we aim to further advance aircraft engine blade defect detection technology, ensuring aviation safety and the efficient operation of equipment.

Author Contributions

Methodology, S.W.; Validation, Z.S.; Writing—original draft, S.W.; Writing—review & editing, Z.S., X.F. and D.C.; Funding acquisition, S.W. and D.C. All authors have read and agreed to the published version of the manuscript.

Funding

Postgraduate Research & Practice Innovation Program of Jiangsu Province under Grant KYCX25_4338, and National Natural Science Foundation of China under Grant 52305060.

Data Availability Statement

Data are not publicly available due to a confidentiality agreement.

Acknowledgments

This work was supported by the Postgraduate Research & Practice Innovation Program of Jiangsu Province under Grant KYCX25_4338, and in part by the National Natural Science Foundation of China under Grant 52305060.

Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References

Liu, Y.-H.; Zheng, Y.-Q.; Shao, Z.-F.; Wei, T.; Cui, T.-C.; Xu, R. Defect detection of the surface of wind turbine blades combining attention mechanism. Adv. Eng. Inform. 2024, 59, 102292. [Google Scholar] [CrossRef]
Qi, H.; Kong, X.; Liu, Z.; Gu, J.; Cheng, L. SAIT: Harnessing Sparse Annotations and Intrinsic Tasks for Semisupervised Aeroengine Defect Segmentation. IEEE Trans. Ind. Inform. 2024, 20, 10463–10472. [Google Scholar] [CrossRef]
Ma, Y.; Liu, M.; Zhang, Y.; Wang, X.; Wang, Y. SPDP-Net: A Semantic Prior Guided Defect Perception Network for Automated Aero-Engine Blades Surface Visual Inspection. IEEE Trans. Autom. Sci. Eng. 2025, 22, 2724–2733. [Google Scholar] [CrossRef]
Sun, X.; Song, K.; Wen, X.; Wang, Y.; Yan, Y. SDD-DETR: Surface Defect Detection for No-Service Aero-Engine Blades with Detection Transformer. IEEE Trans. Autom. Sci. Eng. 2025, 22, 6984–6997. [Google Scholar] [CrossRef]
Wang, S.; Cheng, D.-J.; Fang, X.-F.; Zhang, C.-Y. SDA-PVTDet: A spatial-cross dual attention pyramid vision transformer detector for casting defect detection in radiography images. Expert Syst. Appl. 2025, 269, 126385. [Google Scholar] [CrossRef]
Liu, Q.; Liu, M.; Jonathan, Q.M.; Shen, W. A real-time anchor-free defect detector with global and local feature enhancement for surface defect detection. Expert Syst. Appl. 2024, 246, 123199. [Google Scholar] [CrossRef]
Ma, R.; Chen, J.; Feng, Y.; Zhou, Z.; Xie, J. ELA-YOLO: An efficient method with linear attention for steel surface defect detection during manufacturing. Adv. Eng. Inform. 2025, 65, 103377. [Google Scholar] [CrossRef]
Yang, H.; Zhou, S.; Liu, L.; Zhou, Z. A fast and accurate detection model of internal defects in tunnel lining for ground penetrating radar image data. Adv. Eng. Inform. 2025, 68, 103812. [Google Scholar] [CrossRef]
Li, Q.; Ding, C.; Wang, B.; Jiao, J.; Huang, W.; Zhu, Z. CDARNet: A robust cross-dimensional adaptive region reconstruction network for real-time metal surface defect segmentation. Adv. Eng. Inform. 2025, 67, 103514. [Google Scholar] [CrossRef]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Region-Based Convolutional Networks for Accurate Object Detection and Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 38, 142–158. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. In Computer Vision—ECCV 2020; Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M., Eds.; Springer International Publishing: Cham, Switzerland, 2020; pp. 213–229. [Google Scholar]
Shelhamer, E.; Long, J.; Darrell, T. Fully Convolutional Networks for Semantic Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 640–651. [Google Scholar] [CrossRef] [PubMed]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015; Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F., Eds.; Springer International Publishing: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar]
Wang, Q.; Liu, P.; Zhang, L.; Cheng, F.; Qiu, J.; Zhang, X. Rate–distortion optimal evolutionary algorithm for JPEG quantization with multiple rates. Knowl.-Based Syst. 2022, 244, 108500. [Google Scholar] [CrossRef]
He, J.; He, X.; Zhang, M.; Xiong, S.; Chen, H. Deep dual-domain semi-blind network for compressed image quality enhancement. Knowl.-Based Syst. 2022, 238, 107870. [Google Scholar] [CrossRef]
Silveira TLTd Canterle, D.R.; Coelho, D.F.G.; Coutinho, V.A.; Bayer, F.M.; Cintra, R.J. A Class of Low-Complexity DCT-Like Transforms for Image and Video Coding. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 4364–4375. [Google Scholar] [CrossRef]
Cheng, X.; Wang, J.; Wang, H.; Luo, X.; Ma, B. Quantization Step Estimation of Color Images Based on Res2Net-C with Frequency Clustering Prior Knowledge. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 632–646. [Google Scholar] [CrossRef]
Hu, J.; Lu, Y.; Zhang, S.; Cao, L. ISTR: Mask-Embedding-Based Instance Segmentation Transformer. IEEE Trans. Image Process. 2024, 33, 2895–2907. [Google Scholar] [CrossRef]
Zhu, X.; Wang, W.; Zhang, C.; Wang, H. Polyp-Mamba: A Hybrid Multi-Frequency Perception Gated Selection Network for polyp segmentation. Inf. Fusion 2025, 115, 102759. [Google Scholar] [CrossRef]
Huang, X.; Li, Z.; Peng, L.; Chu, Y.; Miles, Z.; Chakrapani, S.K.; Han, M.; Poudel, A.; Deng, Y. A novel multi-fidelity Gaussian process regression approach for defect characterization in motion-induced eddy current testing. NDT E Int. 2025, 150, 103274. [Google Scholar] [CrossRef]
Li, Y.; Kang, Y.; Chen, Y.; Guo, Y.; Duan, Z.; Feng, B. Feature Enhancement Method for Magnetic Particle Testing Based on Isolation Strip. IEEE Trans. Instrum. Meas. 2024, 73, 3521708. [Google Scholar] [CrossRef]
Ciampa, F.; Mahmoodi, P.; Pinto, F.; Meo, M. Recent Advances in Active Infrared Thermography for Non-Destructive Testing of Aerospace Components. Sensors 2018, 18, 609. [Google Scholar] [CrossRef]
Ageeva, V.; Stratoudaki, T.; Clark, M.; Somekh, M. Integrative Solution for In-situ Ultrasonic Inspection of Aero-engine Blades Using Endoscopic Cheap Optical Transducers (CHOTs). In Proceedings of the 5th International Symposium on NDT in Aerospace, Singapore, 13–15 November 2013. [Google Scholar]
Wong, W.K.; Ng, S.H.; Xu, K. A statistical investigation and optimization of an industrial radiography inspection process for aero-engine components. Qual. Reliab. Eng. Int. 2006, 22, 321–334. [Google Scholar] [CrossRef]
Su, J.; Luo, Q.; Yang, C.; Gui, W.; Silvén, O.; Liu, L. PMSA-DyTr: Prior-Modulated and Semantic-Aligned Dynamic Transformer for Strip Steel Defect Detection. IEEE Trans. Ind. Inform. 2024, 20, 6684–6695. [Google Scholar] [CrossRef]
Lei, L.; Li, H.X.; Yang, H.D. Reliable and Lightweight Adaptive Convolution Network for PCB Surface Defect Detection. IEEE Trans. Instrum. Meas. 2024, 73, 2003208. [Google Scholar] [CrossRef]
Ma, Z.; Li, Y.; Huang, M.; Deng, N. Online visual end-to-end detection monitoring on surface defect of aluminum strip under the industrial few-shot condition. J. Manuf. Syst. 2023, 70, 31–47. [Google Scholar] [CrossRef]
Liu, G.; Chu, M.; Gong, R.; Zheng, Z. Global attention module and cascade fusion network for steel surface defect detection. Pattern Recognit. 2025, 158, 110979. [Google Scholar] [CrossRef]
Hong, J.; Hu, H.; Zhang, H.; Song, K.; Zhou, Q. DD-Net: A defect detection model for carbon fiber-reinforce thermoplastic prepreg surface. Measurement 2026, 266, 120313. [Google Scholar] [CrossRef]
Zhou, X.; Zhang, Y.; Ren, Z.; Mi, T.; Jiang, Z.; Yu, T.; Zhou, S. A Unet-inspired spatial-attention transformer model for segmenting gear tooth surface defects. Adv. Eng. Inform. 2024, 62, 102933. [Google Scholar] [CrossRef]
Guclu, E.; Aydin, I.; Akin, E.; Topkaya, A. A dual-branch attention-based segmentation model for accurate detection of small-scale terminal surface defects in industrial environments. Measurement 2026, 259, 119582. [Google Scholar] [CrossRef]
Qiu, Y.; Liu, H.; Liu, J.; Shi, B.; Li, Y. Region and Edge-Aware Network for Rail Surface Defect Segmentation. IEEE Trans. Instrum. Meas. 2024, 73, 3522213. [Google Scholar] [CrossRef]
Zhou, W.; Hong, J.; Yan, W.; Jiang, Q. Modal Evaluation Network via Knowledge Distillation for No-Service Rail Surface Defect Detection. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 3930–3942. [Google Scholar] [CrossRef]
Tao, X.; Gong, X.; Zhang, X.; Yan, S.; Adak, C. Deep Learning for Unsupervised Anomaly Localization in Industrial Images: A Survey. IEEE Trans. Instrum. Meas. 2022, 71, 5018021. [Google Scholar] [CrossRef]
Yao, N.; Zhao, Y.; Guo, Y.; Kong, S.G. Few-Sample Anomaly Detection in Industrial Images with Edge Enhancement and Cascade Residual Feature Refinement. IEEE Trans. Ind. Inform. 2024, 20, 13975–13985. [Google Scholar] [CrossRef]
Liu, T.; Li, B.; Du, X.; Jiang, B.; Geng, L.; Wang, F.; Zhao, Z. Simple and effective Frequency-aware Image Restoration for industrial visual anomaly detection. Adv. Eng. Inform. 2025, 64, 103064. [Google Scholar] [CrossRef]
Wu, Q.; Li, H.; Tian, C.; Wen, L.; Li, X. AEKD: Unsupervised auto-encoder knowledge distillation for industrial anomaly detection. J. Manuf. Syst. 2024, 73, 159–169. [Google Scholar] [CrossRef]
Chen, Q.; Luo, H.; Gao, H.; Lv, C.; Zhang, Z. Progressive Boundary Guided Anomaly Synthesis for Industrial Anomaly Detection. IEEE Trans. Circuits Syst. Video Technol. 2025, 35, 1193–1208. [Google Scholar] [CrossRef]
Fallahy, S.; Rezazadeh, N. MARBLE-DA: Masonry analysis with robust, batch-normalised, label-free, explainable domain adaptation for crack detection. J. Build. Eng. 2025, 116, 114673. [Google Scholar] [CrossRef]
Wang, D.; Xiao, H.; Wu, D. Application of unsupervised adversarial learning in radiographic testing of aeroengine turbine blades. NDT E Int. 2023, 134, 102766. [Google Scholar] [CrossRef]
Lian, X.; Zheng, Y.; Dang, Z.; Peng, C.; Gao, X. Semi-supervised anomaly traffic detection via multi-frequency reconstruction. Pattern Recognit. 2025, 161, 111215. [Google Scholar] [CrossRef]
Wang, D.; Zhou, T.; Zhang, Y.; Gao, S.; Yang, J. Frequency-Aware Interaction Network for Ultrasound Image Segmentation. IEEE Trans. Circuits Syst. Video Technol. 2025, 35, 7020–7032. [Google Scholar] [CrossRef]
Wu, Z.; Liu, W.; Wang, J.; Li, J.; Huang, D. FrePrompter: Frequency self-prompt for all-in-one image restoration. Pattern Recognit. 2025, 161, 111223. [Google Scholar] [CrossRef]
Zheng, W.; Wang, F.-Y. Do the best of all together: Hierarchical spatial-frequency fusion transformers for animal re-identification. Inf. Fusion 2025, 113, 102612. [Google Scholar] [CrossRef]
Hu, L.; Yuan, J.; Cheng, B.; Xu, Q. CSFPR-RTDETR: Real-Time Small Object Detection Network for UAV Images Based on Cross-Spatial-Frequency Domain and Position Relation. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5638219. [Google Scholar] [CrossRef]
Song, T.; Mao, L.; Zhang, L.; Qin, A.; Yang, F.; Gao, C. Frequency-prompt guided spectral–spatial transformer for hyperspectral image classification. Eng. Appl. Artif. Intell. 2025, 159, 111711. [Google Scholar] [CrossRef]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 9992–10002. [Google Scholar]
Su, B.; Chen, H.; Zhou, Z. BAF-Detector: An Efficient CNN-Based Detector for Photovoltaic Cell Defect Detection. IEEE Trans. Ind. Electron. 2022, 69, 3161–3171. [Google Scholar] [CrossRef]
Liang, F.; Zhao, L.; Ren, Y.; Wang, S.; To, S.; Abbas, Z.; Islam, M.S. LAD-Net: A lightweight welding defect surface non-destructive detection algorithm based on the attention mechanism. Comput. Ind. 2024, 161, 104109. [Google Scholar] [CrossRef]
Liu, P.; Yuan, X.; Han, Q.; Xing, B.; Hu, X.; Zhang, J. Micro-defect Varifocal Network: Channel attention and spatial feature fusion for turbine blade surface micro-defect detection. Eng. Appl. Artif. Intell. 2024, 133, 108075. [Google Scholar] [CrossRef]
Xu, L.; Dong, S.; Wei, H.; Ren, Q.; Huang, J.; Liu, J. Defect signal intelligent recognition of weld radiographs based on YOLO V5-IMPROVEMENT. J. Manuf. Process. 2023, 99, 373–381. [Google Scholar] [CrossRef]
Liu, X.; Wu, Q.; Cheng, Y.; Wen, G. A novel prior knowledge-based and texture-enhanced network for few-shot industrial defect segmentation. Eng. Appl. Artif. Intell. 2025, 162, 112408. [Google Scholar] [CrossRef]
Zhu, W.; Zhang, H.; Eastwood, J.; Qi, X.; Jia, J.; Cao, Y. Concrete crack detection using lightweight attention feature fusion single shot multibox detector. Knowl.-Based Syst. 2023, 261, 110216. [Google Scholar] [CrossRef]
Zhu, W.; Zhang, H.; Zhang, C.; Zhu, X.; Guan, Z.; Jia, J. Surface defect detection and classification of steel using an efficient Swin Transformer. Adv. Eng. Inform. 2023, 57, 102061. [Google Scholar] [CrossRef]
Dong, H.; Song, K.; He, Y.; Xu, J.; Yan, Y.; Meng, Q. PGA-Net: Pyramid Feature Fusion and Global Context Attention Network for Automated Surface Defect Detection. IEEE Trans. Ind. Inform. 2020, 16, 7448–7458. [Google Scholar] [CrossRef]
Chen, L.-C.; Papandreou, G.; Schroff, F.; Adam, H. Rethinking Atrous Convolution for Semantic Image Segmentation. arXiv 2017, arXiv:1706.05587. [Google Scholar] [CrossRef]
Yang, Z.; Peng, X.; Yin, Z.; Yang, Z. Deeplab_v3_plus-net for Image Semantic Segmentation with Channel Compression. In Proceedings of the 2020 IEEE 20th International Conference on Communication Technology (ICCT), Nanning, China, 28–31 October 2020; pp. 1320–1324. [Google Scholar]
Guo, M.-H.; Lu, C.-Z.; Hou, Q.; Liu, Z.; Cheng, M.-M.; Hu, S.-M. SegNeXt: Rethinking Convolutional Attention Design for Semantic Segmentation. In Advances in Neural Information Processing Systems; Oh, S., Ed.; Curran Associates Inc.: Red Hook, NY, USA, 2022; pp. 1140–1156. [Google Scholar]
Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and efficient design for semantic segmentation with transformers. In Proceedings of the 35th International Conference on Neural Information Processing Systems; Curran Associates Inc.: Red Hook, NY, USA, 2021; p. 924. [Google Scholar]
Xiao, T.; Liu, Y.; Zhou, B.; Jiang, Y.; Sun, J. Unified Perceptual Parsing for Scene Understanding. In Computer Vision—ECCV 2018; Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y., Eds.; Springer International Publishing: Cham, Switzerland, 2018; pp. 432–448. [Google Scholar]
Zhao, H.; Zhang, Y.; Liu, S.; Shi, J.; Loy, C.C.; Lin, D.; Jia, J. PSANet: Point-wise Spatial Attention Network for Scene Parsing. In Computer Vision—ECCV 2018; Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y., Eds.; Springer International Publishing: Cham, Switzerland, 2018; pp. 270–286. [Google Scholar]
Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid Scene Parsing Network. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 6230–6239. [Google Scholar]
Cao, Y.; Xu, J.; Lin, S.; Wei, F.; Hu, H. GCNet: Non-Local Networks Meet Squeeze-Excitation Networks and Beyond. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), Seoul, Republic of Korea, 27–28 October 2019; pp. 1971–1980. [Google Scholar]
Zhao, H.; Qi, X.; Shen, X.; Shi, J.; Jia, J. ICNet for Real-Time Semantic Segmentation on High-Resolution Images. In Computer Vision—ECCV 2018; Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y., Eds.; Springer International Publishing: Cham, Switzerland, 2018; pp. 418–434. [Google Scholar]
Fan, M.; Lai, S.; Huang, J.; Wei, X.; Chai, Z.; Luo, J.; Wei, X. Rethinking BiSeNet For Real-time Semantic Segmentation. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 9711–9720. [Google Scholar]
Ding, T.; Li, G.; Liu, Z.; Wang, Y. Cross-Scale Edge Purification Network for salient object detection of steel defect images. Measurement 2022, 199, 111429. [Google Scholar] [CrossRef]
Zhang, G.; Lu, Y.; Jiang, X.; Jin, S.; Li, S.; Xu, M. LGGFormer: A dual-branch local-guided global self-attention network for surface defect segmentation. Adv. Eng. Inform. 2025, 64, 103099. [Google Scholar] [CrossRef]
Liu, X.; Liu, J.; Zhang, H.; Zhang, H. Low-contrast X-ray image defect segmentation via a novel core-profile decomposition network. Comput. Ind. 2024, 161, 104123. [Google Scholar] [CrossRef]
Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.Y.; et al. Segment Anything. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–3 October 2023; pp. 3992–4003. [Google Scholar]
Wang, W.; Xie, E.; Li, X.; Fan, D.P.; Song, K.; Liang, D.; Lu, T.; Luo, P.; Shao, L. Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 548–558. [Google Scholar]

Figure 1. The radiographic inspection process of the blade castings.

Figure 2. Representative defect examples in digital radiographic images: (a) poor image quality, (b) intraclass variance and interclass similarity, (c) irregular defect geometry.

Figure 3. (a) Overall architecture of proposed SFCF-Net. (b) Architecture of frequency-guided decoder. (c) Architecture of the SCC module. (d) Architecture of the CRC module.

Figure 4. Specific process of frequency transform.

Figure 5. Architecture of the proposed SCC module.

Figure 6. Architecture of the proposed CRC module.

Figure 7. Architecture of the proposed AWA module.

Figure 8. Image data processing pipeline and example defect images in ATBCD-Seg dataset.

Figure 9. Example images in NEU-Seg dataset.

Figure 10. Qualitative comparison between the proposed SFCF-Net and other methods on ATBCD-Seg dataset.

Figure 11. Qualitative comparison between the proposed SFCF-Net and other methods on NEU-Seg dataset.

Figure 12. Radial bar chart visualization of an ablation study.

Figure 13. Visualization of heat maps with different frequency components.

Figure 14. Visualization of ablation study results on window size design in the AWA module.

Figure 15. Three different combinations of strategies.

Table 1. Quantitative comparison on ATBCD-Seg dataset.

Method	IoU/Acc			mIoU	mAcc
Method	Background	Re	SI	mIoU	mAcc
FCN	98.85/99.09	31.92/73.98	65.67/81.93	65.48	85.00
U-net	99.01/99.21	36.06/77.38	69.68/84.99	68.25	87.20
Deeplabv3	99.63/99.77	65.69/83.71	80.08/90.33	81.80	90.33
Deeplabv3+	99.64/99.79	66.77/83.96	79.97/89.81	82.13	91.18
Segnext	99.53/99.68	58.74/84.10	78.81/88.55	79.02	90.78
Upernet	99.49/99.92	53.85/56.97	58.91/63.78	70.75	73.56
Psanet	99.63/99.80	75.28/83.57	76.63/86.25	80.91	89.36
Gcnet	99.64/99.78	68.51/84.64	77.33/90.07	81.83	91.50
Icnet	99.65/99.83	67.18/81.38	79.73/87.38	82.19	89.53
Stdc	99.53/99.73	67.29/74.38	64.30/91.35	77.04	88.48
CSEPNet	99.62/99.77	67.39/81.91	75.17/89.88	80.72	90.52
SAM	99.74/99.88	77.34/82.53	79.19/91.63	85.42	91.35
Segformer	99.64/99.86	71.46/76.79	70.66/83.82	80.59	86.82
CDARNet	99.66/99.80	71.28/83.47	78.62/88.26	83.85	90.51
LGGFormer	99.68/99.82	72.45/84.16	79.34/88.72	84.15	90.90
SPDP-Net	99.48/99.68	63.74/78.95	74.89/85.34	79.37	88.66
REA-Net	99.45/99.65	62.15/79.23	73.52/85.67	78.37	88.18
CPDNet	99.52/99.71	64.38/80.52	75.16/86.91	79.69	89.05
SFCF-Net (s)	99.77/99.89	79.24/86.09	82.42/91.84	87.14	92.61
SFCF-Net (b)	99.79/99.90	80.60/88.43	84.79/92.46	88.39	93.59

Table 2. Quantitative comparison on NEU-Seg dataset.

Method	IoU/Acc				mIoU	mAcc
Method	Background	Pa	Sc	In	mIoU	mAcc
FCN	97.61/98.88	85.62/91.44	76.98/86.27	77.81/87.88	84.51	91.12
U-Net	96.93/98.83	80.75/86.69	77.75/84.63	67.02/77.95	80.61	87.03
Deeplabv3	97.73/98.97	85.41/90.02	81.38/88.81	79.54/90.15	86.01	92.00
Deeplabv3+	97.46/98.99	84.15/88.85	79.79/89.10	73.65/83.93	83.76	90.22
Segnext	97.79/98.85	86.02/92.09	82.47/92.94	79.37/89.62	86.41	93.37
Upernet	97.67/98.91	85.50/91.45	80.07/89.03	78.04/87.19	85.32	91.64
Psanet	97.71/98.93	85.54/91.19	80.26/88.71	79.66/90.06	85.79	92.22
Pspnet	97.73/98.92	85.75/91.63	80.80/88.61	78.73/89.06	85.75	92.06
Gcnet	97.72/98.91	85.79/91.77	79.78/88.11	79.19/88.72	85.62	91.88
Icnet	97.02/98.35	82.85/92.06	75.12/83.06	69.78/84.91	81.20	89.59
Stdc	96.84/97.86	84.10/94.13	75.38/85.11	62.31/88.63	79.66	91.43
CSEPNet	96.82/98.26	81.14/91.24	79.01/88.23	64.78/76.55	80.44	88.57
SAM	97.80/98.85	86.37/93.59	83.05/90.09	76.88/84.65	86.03	91.80
Segformer	97.54/98.66	85.13/91.10	80.06/90.88	76.28/93.67	84.76	93.58
SFCF-Net (s)	97.80/98.95	86.07/92.07	82.55/89.63	78.67/87.99	86.27	92.16
SFCF-Net (b)	97.89/99.01	86.54/91.87	83.45/91.49	79.61/89.30	86.87	97.92

Table 3. Ablation results of different modules.

	Baseline	SCC	CRC	AWA	mIoU	mRecall	mDice	mPrecision	FBIoU
	√				77.18	89.81	85.31	81.51	80.05
Model 1	√	√			81.84	90.13	88.57	86.49	84.62
Model 2	√	√	√		83.89	91.90	90.68	89.57	88.34
Model 3	√	√		√	83.13	91.00	90.30	88.13	87.25
Model 4	√		√	√	82.47	91.25	89.83	87.68	85.91
SFCF-Net (b)	√	√	√	√	88.39	92.34	93.41	93.22	90.27

Table 4. Experimental results of frequency components.

Method	mIoU	mAcc	mPrecision	FBIoU
Low	82.15	87.34	88.67	84.82
High	86.73	91.28	91.85	88.19
Full	88.39	93.59	93.22	90.27

Table 5. Ablation results on window size design in the AWA module.

Window Design	IoU/Acc			mIoU	mAcc	FBIoU
Window Design	Background	Re	SI	mIoU	mAcc	FBIoU
Square	99.12/99.35	77.12/85.31	82.95/90.28	85.73	90.82	87.41
Horizontal	99.34/99.51	79.25/86.72	83.56/90.84	86.72	91.69	88.35
Vertical	99.25/99.42	78.89/86.47	82.89/89.92	86.21	91.15	87.89
F-B	99.48/99.58	83.67/87.93	84.21/91.58	87.45	92.33	89.12
P-B	99.79/99.90	80.60/88.43	84.79/92.46	88.39	93.59	90.27

Table 6. Ablation results on cross-modal fusion strategies.

Method	mIoU	mAcc	mPrecision	FBIoU
$S_{c}$	88.02	91.53	91.98	90.11
$S_{g}$	87.62	92.35	93.10	90.19
P	88.39	93.59	93.22	90.27

Table 7. Ablation results on different backbone networks of SFCF-Net.

Backbone	mIoU	mAcc	Parameters	FlOPs	Inference Time
ResNet-50	83.34	87.02	64.2 M	56.8 G	168.7 ms
ResNet-101	84.01	87.64	83.1 M	60.8 G	175.2 ms
PVT-M	84.20	87.86	65.2 M	59.7 G	175.7 ms
PVT-L	85.91	88.75	101.2 M	67.9 G	185.3 ms
Swin-S	87.14	92.61	84.7 M	61.7 G	177.1 ms
Swin-B	88.39	93.59	149.5 M	104.2 G	223.6 ms

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wang, S.; Sun, Z.; Fang, X.; Cheng, D. SFCF-Net: Spatial-Frequency Synergistic Learning for Casting Defect Segmentation of Pre-Service Aircraft Engine Blades in Industrial Radiographic Inspection. Sensors 2026, 26, 1416. https://doi.org/10.3390/s26051416

AMA Style

Wang S, Sun Z, Fang X, Cheng D. SFCF-Net: Spatial-Frequency Synergistic Learning for Casting Defect Segmentation of Pre-Service Aircraft Engine Blades in Industrial Radiographic Inspection. Sensors. 2026; 26(5):1416. https://doi.org/10.3390/s26051416

Chicago/Turabian Style

Wang, Shun, Zhiying Sun, Xifeng Fang, and Dejun Cheng. 2026. "SFCF-Net: Spatial-Frequency Synergistic Learning for Casting Defect Segmentation of Pre-Service Aircraft Engine Blades in Industrial Radiographic Inspection" Sensors 26, no. 5: 1416. https://doi.org/10.3390/s26051416

APA Style

Wang, S., Sun, Z., Fang, X., & Cheng, D. (2026). SFCF-Net: Spatial-Frequency Synergistic Learning for Casting Defect Segmentation of Pre-Service Aircraft Engine Blades in Industrial Radiographic Inspection. Sensors, 26(5), 1416. https://doi.org/10.3390/s26051416

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

SFCF-Net: Spatial-Frequency Synergistic Learning for Casting Defect Segmentation of Pre-Service Aircraft Engine Blades in Industrial Radiographic Inspection

Abstract

1. Introduction

2. Related Works

2.1. Defect Detection

2.2. Frequency Learning in Vision Tasks

3. Proposed Method

3.1. Selective Cross-Modal Calibration

3.1.1. Frequency Transform

3.1.2. Cross-Modal Calibration

3.2. Cross-Modal Refinement and Complementation Module

3.3. Asymmetric Window Attention

4. Experiments and Results

4.1. Datasets Description

4.2. Evaluation Metrics

4.3. Implementation Details

4.4. Comparative Experiments on ATBCD-Seg and NEU-Seg Datasets

4.4.1. Quantitative Comparison

4.4.2. Qualitative Comparison

4.5. Ablation Study

4.5.1. Effectiveness of Different Modules

4.5.2. Effectiveness of Frequency-Domain Features in SCC Module

4.5.3. Effectiveness of Rectangular Window Design in AWA Module

4.5.4. Effectiveness of Feature Combination Strategies

4.5.5. Effectiveness of Different Backbone Networks

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI