3.1. Adaptive Dynamic Multi-Scale Perception Network (ADMPNet)
The standard ResNet [
38] backbone network has limitations when analyzing wind turbine surface defects. First, ResNet cannot handle multiple sizes of wind turbine surface defect features due to its fixed-sized convolution kernels. Second, ResNet lacks the ability to adaptively select features for residual connections. It loses valuable transferable features when processing complex backgrounds. Finally, ResNet does not employ multi-channel fusion techniques, preventing it from utilizing all information from multiple receptive fields for comprehensive detection of smaller wind turbine surface defects.
Therefore, this paper proposes a new network called the Adaptive Dynamic Multi-scale Perception Network (ADMPNet). It contains a hierarchical feature extraction structure that includes dynamic feature weighting and specific modules for the Adaptive Dynamic Mixed Block (ADMBlock), Multi-Dimensional Information Fusion (MDIFusion) and Adaptive Gated Feature Unit (AGFUnit). ADMPNet can understand and quantify complex spatial–channel dependencies in multi-scale defect features, providing good performance for wind turbine surface defects in challenging environments. The ADMPNet architecture is shown in
Figure 2.
The ADMPNet backbone network is built on progressive multi-scale feature learning concepts. The backbone incorporates adaptive, hierarchical feature extraction capabilities through a structured abstraction process. It first extracts coarse-grained features and then progressively refines feature levels by introducing the ADMBlock within each downsampling layer. These ADMBlock modules contribute localized texture details that enhance the global semantic representation. The mathematical feature learning process of ADMPNet can be expressed as
where
L represents hierarchical levels of the network,
represents the feature tensor of layer
,
represents the downsampling transformation of layer
i,
represents the nonlinear feature transformation function of the ADMBlock,
represents the adaptive feature aggregation operation,
represents the global context encoder,
G represents the dynamic weight generation function, ⨁ represents the multi-level feature fusion operator, and ⊙ represents an element-wise (Hadamard) product operation.
The ADMBlock adopts a dual-branch residual learning architecture that integrates an adaptive scaling mechanism. It achieves unified multi-scale spatial feature extraction and inter-channel information interaction. The ADMBlock structure is shown in
Figure 3. This module decomposes complex feature learning tasks into two collaborative sub-processes: first conducting multi-scale spatial dependency modeling through MDIFusion, and then utilizing the AGFUnit to achieve nonlinear feature enhancement in the channel dimension. Each sub-process is equipped with independent batch normalization layers for feature standardization. MDIFusion achieves adaptive fusion of multi-scale features through a multi-branch parallel convolution structure, with the mathematical expression
where
K represents the number of multi-scale branches, and
specifies the channel-splitting strategy for the k-th branch. The AGFUnit adopts a gated linear unit mechanism to achieve efficient inter-channel feature interaction, with the calculation process expressed as
where
represents the channel-splitting operation according to strategy
,
represents a
convolution transformation,
represent input transformation, output transformation, depthwise convolution, and gate transformation functions respectively, and ⊗ represents an element-wise gating operation.
AIDConvolution constructs a heterogeneous multi-branch depthwise separable convolution architecture. Adaptive extraction and fusion of features from different spatial directions and scales are utilized. This module includes three parallel branches with different geometric receptive fields: a square convolution kernel branch
to capture complete spatial context information in local regions, a horizontal strip convolution kernel branch
specifically to model linear structural features in the horizontal direction, and a vertical strip convolution kernel branch
that focuses on extracting spatial dependency relationships in the vertical direction. The output of each branch undergoes adaptive weighted fusion through a dynamic weight allocation mechanism based on global statistical information. AIDConvolution is detailed in Equations (
A1) and (
A2) in
Appendix A.
The ADMPNet backbone network enhances multi-scale defect feature extraction using a dual-branch residual architecture and dynamic weight mechanism implemented in the ADMBlock. This architecture integrates MDIFusion and AIDConvolution to achieve adaptive fusion of heterogeneous multi-branch features. The AGFUnit enhances the inter-channel information interaction capability through gating mechanisms. ADMPNet improves the feature representation capability while ensuring training stability, providing high-quality multi-scale semantic features for the RT-DETR detection head.
3.2. Hierarchical Dynamic Feature Pyramid Network (HDFPN)
The traditional FPN has three main limitations for wind turbine surface defect detection. First, linear-interpolation-based upsampling causes semantic conflicts when fusing multi-scale features. Second, unidirectional top-down propagation ignores global contextual relationships across different scales. Third, fixed-weight fusion strategies cannot adapt to varying feature importance. To address these issues, this paper proposes a Hierarchical Dynamic Feature Pyramid Network (HDFPN). It achieves hierarchical modeling and precise integration of multi-scale defect features through cross-scale global context representation, recursive channel–spatial collaboration mechanisms, and dynamic adaptive fusion strategies, improving detection accuracy for complex defects such as tiny cracks and edge corrosion.
The HDFPN adopts a hierarchical processing architecture of global context aggregation, recursive feature enhancement, and dynamic adaptive fusion. It achieves precise modeling and integration of multi-scale defect features through four core modules. As shown in
Figure 4, the network starts with three pyramid-level features, P3, P4, and P5, from the backbone network. The Pyramid Adaptive Context Extraction (PACE) module first constructs cross-scale global context representation. It uses adaptive spatial alignment to unify different-resolution features and employs recursive channel attention to capture cross-scale semantic dependencies, alleviating semantic heterogeneity among multi-scale features. The Global Relationship Module (GRM) optimizes context features at each pyramid level. It suppresses background noise and enhances defect region identification through channel–spatial decoupling and iterative anisotropic attention. The AMDF module performs proportionate fusion based on high-level feature semantics. It creates spatial attention masks to determine the relative importance of detail-level features when fusing with semantic-level features, enhancing both semantic information and detail texture in defect areas. The Dynamic Adaptive Interpolation Fusion achieves content-adaptive fusion of cross-level features through learnable convolution parameters, overcoming traditional fixed-weight interpolation limitations. The recursive feature propagation and fusion process can be formally expressed as
where
represents the pyramid context extraction operator,
is the recursive calibration transformation,
is the dynamic interpolation fusion operator,
is the dynamic fusion operator,
represents bilinear upsampling,
is the layer context feature
l,
represents a feature projection operation at layer
l, and
denotes the concatenation of feature maps from pyramid levels 3, 4, and 5. This recursive framework achieves progressive transfer of high-level semantics and hierarchical integration of multi-scale features.
PACE alleviates semantic gap problems of multi-scale features by constructing a cross-scale global context representation. The module operates through two main steps. First, it employs pyramid pooling aggregation to spatially align input multi-scale features. Second, it introduces recursive channel attention for
n iterations of feature refinement. Each iteration uses depthwise separable convolution to extract local spatial patterns, combines horizontal and vertical global pooling to capture anisotropic spatial features, and performs channel excitation through two-stage strip convolution. The pyramid context extraction process is represented as
where
represents adaptive pooling to size
s, ⨁ represents an element-wise addition operation,
denotes the minimum spatial size for feature alignment,
is
n iterations of recursive calibration,
n is set to 3 in our implementation, and
splits features by specified channel numbers.
The iterative process of the Gradient Refinement Module (GRM) integrates depthwise convolution, anisotropic pooling, and strip convolution excitation, with the structure shown in
Figure 5. The workflow can be expressed as follows:
where
is a depthwise separable convolution extracting local spatial features,
and
represent horizontal and vertical global average pooling respectively, ⊕ is element-wise addition,
represents cascaded strip convolution sequences
for channel excitation,
is the sigmoid activation function,
represents the feedforward network (including BN layer and MLP),
is a learnable channel-level scaling parameter,
is a learnable scaling parameter, and ⊙ is the Hadamard product.
The Adaptive Multi-scale Fusion Block designs an asymmetric feature fusion mechanism based on high-level semantic guidance. It uses high-level features as spatial attention sources to adaptively weight low-level features. The structure is shown in
Figure 6. The module receives low-resolution low-level features
and high-resolution high-level features
. Channel mapping is performed on both inputs through two
convolutions. Hard-sigmoid activation is then applied to high-level features to generate normalized spatial weight masks. The masks are upsampled to low-level feature resolution through bilinear interpolation. Finally, semantic-guided detail preservation is achieved through element-wise multiplication. The fusion process is expressed as
where
are
convolution kernels,
is the hard-sigmoid function,
is bilinear upsampling to size
s, and ⊙ is the Hadamard product. This asymmetric strategy effectively suppresses low-level feature noise while preserving key details through high-level semantic guidance.
Dynamic Adaptive Interpolation Fusion achieves adaptive fusion of cross-level features through learnable convolution parameters. It overcomes the limitations of fixed-weight fusion. The structure is shown in
Figure 7. The module receives two features,
and
, from adjacent levels. First, channel transformation and semantic mapping are performed on high-level features using
convolution. Then it is upsampled to the spatial scale of low-level features through bilinear interpolation. Finally, fusion is performed in residual form. The dynamic fusion process is expressed as
where
is a parameterized
convolution transformation, and
is bilinear interpolation to size
s.
Through collaborative integration of PACE, GRM, Adaptive Multi-scale Feature Fusion, and Dynamic Adaptive Interpolation Fusion, the HDFPN effectively addresses the traditional FPN’s limitations in wind turbine defect detection. The approach demonstrates improvements in handling semantic conflicts, preserving gradient information, and adapting to diverse defect characteristics. The HDFPN achieves robust feature representation for complex defect patterns through cross-scale global context modeling and recursive feature enhancement. The dynamic adaptive fusion strategy automatically adjusts feature importance based on content characteristics, enhancing detection accuracy for challenging defects such as micro-cracks and surface anomalies. Compared to conventional fixed-weight fusion methods, the HDFPN provides improved multi-scale feature integration performance while maintaining computational efficiency.
3.3. Dynamic Frequency-Domain Feature Encoder (DFDEncoder)
The AIFI encoder has several limitations in detecting wind turbine blade surface defects. First, the encoder pays insufficient attention to small or subtle defects when other interferences are present in industrial environments. Second, the AIFI encoder’s feedforward network architecture only performs feature transformation in the spatial domain. It cannot capture frequency-domain characteristics of surface defects. To address these two issues, this paper proposes a DFDEncoder. It achieves frequency-domain feature enhancement through integrating Frequency-Domain Feedforward Networks (FDFNetworks) and Dual-path Adaptive Feature Extractors (DAFE) architectures. The approach employs the Dynamic Tanh [
39] process. This improves the accuracy and robustness of wind turbine surface defect detection. The DFDEncoder structure is shown in
Figure 8.
The DFDEncoder fully extracts and encodes defect features by integrating three different technical modules. The encoder employs a dual-path approach based on residual connections. The first path contains a multi-head self-attention mechanism and Dynamic Tanh normalization for processing the global context of the data. The DAFE module provides enhanced multi-scale feature extraction. The second path processes data in the frequency domain, employing FDFNetworks and its ability to capture defects’ spectral characteristics. The two output paths utilize residual connections to generate one output. The combined output undergoes further, final optimization through another DAFE module and Dynamic Tanh normalization. The mathematical representation of the overall encoding can be expressed as follows:
where
X represents the input feature tensor,
is a multi-head self-attention operation,
is the frequency-domain feedforward network,
and
represent two dual-path adaptive feature extractors, and
is a Dynamic Tanh operation. FDFNetworks extends traditional spatial-domain convolution operations to the frequency domain. It identifies defect spectral feature patterns through the fast Fourier transform. The module workflow consists of five primary processes. First,
convolution is performed to expand channels and change input feature dimensions. Second, dilated depthwise separable convolution is used to learn local spatial feature patterns while maintaining an increased receptive field and reduced parameters. Third, adaptive padding is applied to features to conform to the FFT algorithm. Fourth, spectral-domain enhancement operations are conducted in the frequency-domain space using learnable frequency-domain weights
and learnable frequency-domain biases
, enabling adaptive modulation of various frequency components. Finally, the inverse fast Fourier transform is performed through a SiLU activation function and gating mechanism. FDFNetworks, detailed in Equations (
A3) and (
A4) in
Appendix A, thereby improves the perception capability for edge and textural features.
DAFE is a lightweight multi-scale feature adapter. It is constructed through parallel multi-scale depthwise convolution operations and dual residual connection mechanisms. DAFE adopts an innovative dual residual structure. The outer residual connection ensures stable feature transfer, while the inner residual connection achieves deep fusion of multi-scale features through the hierarchical multi-scale module (DAFE-HMS). DAFE first performs feature normalization through LayerNorm2d. Then, it uses learnable scaling factors
(initialized to
) and
(initialized to 1) to weight and combine normalized and original features, ensuring training stability in early stages. After dimensionality reduction through
convolution, the features are sent to the DAFE-HMS module for multi-scale processing. This sub-module extracts multi-scale features through three depthwise separable convolutions of different sizes (3 × 3, 5 × 5, 7 × 7) in parallel. Then it performs feature fusion through average pooling. DAFE and DAFE-HMS are detailed in Equations (
A5) and (
A6) in
Appendix A, respectively.
Compared to traditional AIFI encoders, the DFDEncoder achieves frequency-domain enhancement through dual-path adaptive feature extractors. It also improves the model’s ability to accurately extract texture and edge components from visual images. Finally, by introducing Dynamic Tanh, the DFDEncoder enhances adaptability in a wide range of industrial environments. Therefore, the DFDEncoder achieves gains in detecting wind turbine surface defects by combining FDFNetworks and DAFE.