HAMD-DETR: A Wind Turbine Defect Detection Method Integrating Multi-Scale Feature Perception

Tian, Shuhao; Zhang, Pengpeng; Liu, Lin

doi:10.3390/en19051235

Open AccessArticle

HAMD-DETR: A Wind Turbine Defect Detection Method Integrating Multi-Scale Feature Perception

by

Shuhao Tian

,

Pengpeng Zhang

^* and

Lin Liu

School of Electronic Information Engineering, Shanghai Dianji University, Shanghai 201306, China

^*

Author to whom correspondence should be addressed.

Energies 2026, 19(5), 1235; https://doi.org/10.3390/en19051235

Submission received: 21 December 2025 / Revised: 25 January 2026 / Accepted: 2 February 2026 / Published: 2 March 2026

(This article belongs to the Section F5: Artificial Intelligence and Smart Energy)

Download

Browse Figures

Versions Notes

Abstract

Wind turbines operating in harsh environments are prone to surface defects that compromise efficiency and safety. Traditional convolutional neural networks lack sufficient multi-scale feature representation, while Transformer-based methods suffer from excessive computational complexity. This study proposes HAMD-DETR, an end-to-end detection framework for wind turbine defect identification. The framework consists of three key components: an Adaptive Dynamic Multi-scale Perception Network (ADMPNet), a Hierarchical Dynamic Feature Pyramid Network (HDFPN), and a Dynamic Frequency-Domain Feature Encoder (DFDEncoder). Firstly, ADMPNet integrates multi-scale dynamic integration fusion and adaptive inception depthwise convolution for feature extraction. Then the HDFPN balances deep semantic and shallow detail features through pyramid adaptive context extraction and gradient refinement modules. At last, DFDEncoder enhances feature discrimination through frequency-domain transformation. Experiments on wind turbine datasets demonstrate that HAMD-DETR achieves 58.6% mAP50 and 31.7% mAP50-95, representing improvements of 3.1% and 2.1% over the baseline RT-DETR. The proposed method reduces computational complexity by 27.2% and parameters by 30% while achieving a 151.9 FPS inference speed. These results validate HAMD-DETR’s effectiveness for wind turbine defect detection and demonstrate its potential for intelligent operation and maintenance applications.

Keywords:

wind turbine; defect detection; RT-DETR; deep learning; object detection

1. Introduction

With the continuous growth of global demand for renewable energy, wind power generation, as an important component of green energy, is playing a crucial role in the transformation of the energy structure. In 2023, China installed 75 GW of new wind power capacity, accounting for 65% of the global newly installed wind power capacity [1]. As demand for wind power generation increases, wind turbines with higher efficiency and stability in power generation have become key equipment. Wind turbines operate in harsh environments for extended periods, with critical components such as turbine blades frequently affected by factors including severe weather and mechanical wear [2,3], leading to various types of defects, such as cracks and corrosion. These defects reduce turbine power generation efficiency and cause equipment failures and accidents [4,5]. Therefore, timely detection and maintenance of wind turbine equipment plays a vital role in the long-term safe operation of wind power systems. Employing efficient methods for detecting wind turbine surface defects represents the primary strategy for reducing maintenance costs and improving service life.

Researchers have devoted themselves to exploring more efficient and safe methods for wind turbine defect detection [6]. Wind turbine defect detection methods are mainly divided into machine learning methods and deep learning methods. Machine learning methods achieve effective recognition, prediction, and localization of wind turbine defects by designing feature extraction and establishing feature relationship models. Current machine learning-based methods mainly include wavelet transform [7,8], filters [9,10], K-means [11,12], and other approaches. Wind turbine defect detection services face many challenges. Natural interference and complex atmospheric conditions can lead to low contrast between the background and the defect itself. These factors create difficult conditions for machine learning mechanisms to perform well and make improvements.

In recent years, with the rapid development of deep learning, research has shifted from machine learning to deep learning. Currently, defect detection models in the industrial field are mainly divided into single-stage and two-stage models. Two-stage models first generate candidate regions before classification and optimization. However, their slow inference speed has led researchers to develop single-stage models. Single-stage detection uses fixed anchor boxes for classification to achieve higher real-time performance, with YOLO series as representative models [13,14,15,16]. To improve the accuracy of a detection system, Sarkar et al. [17] made a new version of YOLOv3, which was a combination of the original and improved versions of YOLOv3, created by filtering out Laplace variance from an image reconstructed using SRCNN. The new system showed a detection accuracy of 96%, but with resource requirements and limited diversity of data. In subsequent work, Davis et al. [18] performed benchmark testing on multiple update iterations of YOLO and found that the best-performing model was YOLOv9, with an mAP50 score of 0.849. However, all tests were conducted on small wind turbines, with the training using very limited defect types, calling into question whether the models would perform adequately on utility-sized systems. Additionally, Memari et al. [19] addressed the lack of thermal data for training by generating their own custom dataset and employing Bayesian analytics to optimally train and validate the YOLOv9 model. The resulting mAP@0.5 performance score achieved was 80.8%; however, confusion between backgrounds and holes indicates that performance may be limited in complex environments. Nguyen et al. [20] introduced the YOLO-WTB model to detect small-scale damages with a modified version of YOLOv12n that incorporated a new SF1-SPPF module and a modified detection head, achieving an mAP50 score of 65.9%. Although this model was designed specifically for small defects, it is very likely to adversely affect performance in multi-scale detection. In an effort to eliminate problems associated with detecting wind turbine blade damage, such as false positives and false negatives, Liao et al. [21] introduced an improved version of YOLOv7 that incorporated ECA, DownSample, and DGST. Liu et al. [22] proposed a wind turbine damage detection algorithm based on YOLOv8, integrating a C2f-FocalNextBlock module, ResNet-EMA module, and Slimneck structure to address low detection accuracy and high computational resource consumption. YOLO models perform well for wind turbine detection but struggle with complex defect scenarios. Their limited capacity for long-term feature integration and global context modeling compromises detection accuracy.

Therefore, Transformer-based object detection methods have become a research hotspot by leveraging their powerful long-range feature associations. Carion et al. [23] proposed the DETR algorithm, pioneering the application of Transformers in object detection, but DETR suffers from slow convergence and poor real-time inference speed. To address these limitations Zhao et al. [24] proposed RT-DETR, a novel approach that achieves real-time object detection performance. The method employs an efficient decoupled encoder–decoder architecture combined with optimized attention mechanisms. This design improves inference speed while maintaining high detection accuracy. Mansoor et al. [25] created a blend of the multi-scale attention approach and the ViT approach. Their approach to detecting defects fully solves the problems of different lighting conditions during inspection, the complexity of background images and the different sizes and shapes of defects in one image. Gao et al. [26] proposed WTB-DINO to handle size and shape variations in wind turbine blade damage detection. The method integrates improved channel attention mechanisms with parallel encoder architectures to enhance detection performance. Dwivedi et al. [27] investigated the utilization of the ViT model as a tool to identify defects on the surface of wind turbine blades through the use of Transformer encoders and multi-head attention to conduct feature extraction. Their accuracy was 97.33% based on a sample size of 299 images; however, the fixed patch size used by the model will cause problems with classifying defects because defects can occur at different scales, and the restricted size of the dataset resulted in a class imbalance. To detect the more subtle defects present in the blade surface, Memari et al. [28] created an ensemble method utilizing multispectral imagery, as they used the method of multispectral X-ray imaging, which allows for fusing together RGB and thermal infrared images while simultaneously being able to conduct an analysis of the global features using the ViT model, in conjunction with the dense connectivity of DenseNet161. However, because they only conducted their method on a small dataset collected in laboratory settings, it is unknown how well this method translates to actual turbines in operation in the field.

Although Transformer-based techniques have proven successful within the area of object detection, they have deficiencies when it comes to their use in wind turbines. Backbone networks always employ fixed receptive fields, resulting in less efficient extraction of multi-scale features that vastly differ in size. Most neck architectures do not adequately handle gradient preservation through feature pyramids. This leads to poor integration of high-level semantic information with low-level visual details. Consequently, small defects often go undetected. Standard encoders do not include either frequency-domain inputs or adaptive activation strategies, which affects how a model attends to defect locations.

To address the above challenges, this paper proposes Hierarchical Adaptive Multi-scale Dynamic Detection Transformer (HAMD-DETR), with the following main contributions:

(1): We propose an Adaptive Dynamic Multi-scale Perception Network (ADMPNet) that is primarily constructed using an Adaptive Dynamic Mixed Block (ADMBlock) design. The ADMBlock achieves adaptive defect feature extraction at different scales through Multi-scale Dynamic Integration Fusion (MDIFusion) and Adaptive Inception Depthwise Convolution (AIDConvolution), along with the Adaptive Gated Feature Unit (AGFUnit) for enhanced feature selection. This design adaptively adjusts convolution kernel weights based on input features, thereby enhancing the model’s perception capability for complex surface textures.
(2): We developed a Hierarchical Dynamic Feature Pyramid Network (HDFPN), which is a cross-scale gradient-preserving pyramid structure that effectively preserves gradient information during feature transfer. The network employs four key modules to achieve balanced fusion of multi-scale features. The Pyramid Adaptive Context Extraction (PACE) module constructs cross-scale global context representation and addresses semantic heterogeneity through adaptive spatial alignment. The Gradient Refinement Modules (GRM) optimize context features at each pyramid level through channel–spatial decoupling mechanisms. The Adaptive Multi-scale Fusion Block performs semantic-guided feature integration using spatial attention masks. Finally, Dynamic Adaptive Interpolation Fusion achieves content-adaptive fusion through learnable convolution parameters. This architecture enables balanced fusion of deep semantic features and shallow detail features, enhancing feature expression capability for small target defects.
(3): We constructed a Dynamic Frequency-Domain Feature Encoder (DFDEncoder), which integrates Frequency-Domain Feedforward Networks (FDFNetworks), Dual-path Adaptive Feature Extractors (DAFE), and Dynamic Tanh. By providing frequency-domain feature transformation in conjunction with multi-scale adaptive fusion, our method improves the model’s ability to capture features of tiny defect areas and its performance in discriminating defects at multiple scales. In addition, by using dynamic normalization strategies, we were able to optimize the feature representation and ensure accurate and robust detection of wind turbine surface defects under highly variable conditions.

By integrating the three aforementioned modules, our end-to-end framework-based detection model, HAMD-DETR, achieves a good trade-off between efficiency and effectiveness in detecting wind turbine surface defects.

2. Related Work

2.1. Transformer-Based Object Detection Framework

Carion et al. [23] first introduced DETR, an end-to-end object detection system that eliminates the need for pre-defined anchor boxes and post-processing steps such as non-maximum suppression. This approach removes the complex pipeline traditionally required after image detection. However, DETR suffers from several limitations, including slower training convergence and extended training times. Additionally, the model exhibits poor performance when detecting small objects. Zhu et al. [29] developed Deformable DETR to address these limitations. The result was a model that uses less memory during training and converges faster during training. Deformable DETR improves the ability of the previous generation of DETR to detect small items by utilizing deformable rather than standard multi-headed self-attention. Similar to DETR, Deformable DETR also shows a limited ability to represent features sufficiently in dense object situations. Since the introduction of Deformable DETR, researchers have developed additional improved models, including Conditional DETR [30], DAB-DETR [31], and Group-DETR [32]. These approaches optimize detection performance through conditional spatial queries, dynamic anchor box boundaries, and group attention mechanisms. However, ultimately, none of these systems meet the stringent inference speed and real-time detection requirements in a production environment. Recently, Zhao et al. [24] proposed the RT-DETR model, which employs an efficient hybrid encoder. The model achieves a favorable balance between accuracy and speed through decoupled attention computation and optimized feature fusion strategies. Consequently, RT-DETR serves as an ideal baseline model for wind turbine surface defect detection tasks.

2.2. Neck Network Architectures

The Feature Pyramid Network (FPN) was proposed by Lin et al. [33]. It effectively fuses semantic information at different scales by constructing a top-down feature pyramid structure. Therefore, the FPN solves the insufficient feature representation problem in multi-scale object detection. The FPN has now become an essential component of object detection networks. In recent years, numerous researchers have proposed various improved FPN variants, including PANet [34], BiFPN [35], NAS-FPN [36], and AugFPN [37]. These methods further enhance the representation potential of feature pyramids through enhanced feature propagation paths, bidirectional feature fusion, neural architecture search, and data augmentation strategies. Recent FPN research indicates that effective gradient preservation and adaptive feature integration are key factors for improving detection performance. This paper combines the above insights with the requirements of wind turbine defect detection tasks to design a new FPN structure.

3. Methods

This paper proposes HAMD-DETR, consisting of three core modules: Adaptive Dynamic Multi-scale Perception Network (ADMPNet), Hierarchical Dynamic Feature Pyramid Network (HDFPN), and Dynamic Frequency-Domain Feature Encoder (DFDEncoder). The ADMPNet module achieves adaptive extraction of defect features at different scales through adaptive dynamic mixed blocks. The HDFPN module employs pyramid context extraction and recursive calibration mechanisms to achieve balanced fusion of deep semantic features and shallow detail features while preserving gradient information. The DFDEncoder combines frequency-domain feedforward networks and dual-path adaptive feature extractors to enhance feature focus on important defect areas. This improves the overall model’s discrimination performance. The overall HAMD-DETR architecture is shown in Figure 1.

3.1. Adaptive Dynamic Multi-Scale Perception Network (ADMPNet)

The standard ResNet [38] backbone network has limitations when analyzing wind turbine surface defects. First, ResNet cannot handle multiple sizes of wind turbine surface defect features due to its fixed-sized convolution kernels. Second, ResNet lacks the ability to adaptively select features for residual connections. It loses valuable transferable features when processing complex backgrounds. Finally, ResNet does not employ multi-channel fusion techniques, preventing it from utilizing all information from multiple receptive fields for comprehensive detection of smaller wind turbine surface defects.

Therefore, this paper proposes a new network called the Adaptive Dynamic Multi-scale Perception Network (ADMPNet). It contains a hierarchical feature extraction structure that includes dynamic feature weighting and specific modules for the Adaptive Dynamic Mixed Block (ADMBlock), Multi-Dimensional Information Fusion (MDIFusion) and Adaptive Gated Feature Unit (AGFUnit). ADMPNet can understand and quantify complex spatial–channel dependencies in multi-scale defect features, providing good performance for wind turbine surface defects in challenging environments. The ADMPNet architecture is shown in Figure 2.

The ADMPNet backbone network is built on progressive multi-scale feature learning concepts. The backbone incorporates adaptive, hierarchical feature extraction capabilities through a structured abstraction process. It first extracts coarse-grained features and then progressively refines feature levels by introducing the ADMBlock within each downsampling layer. These ADMBlock modules contribute localized texture details that enhance the global semantic representation. The mathematical feature learning process of ADMPNet can be expressed as

F_{A D M P N e t} (X) = ⨁_{i = 1}^{L} Φ_{i} (M_{i} (D_{i} (X_{i - 1})) ⊙ G (G_{i} (X_{i - 1})))

(1)

where L represents hierarchical levels of the network,

X_{i - 1}

represents the feature tensor of layer

i - 1

,

D_{i}

represents the downsampling transformation of layer i,

M_{i}

represents the nonlinear feature transformation function of the ADMBlock,

Φ_{i}

represents the adaptive feature aggregation operation,

G_{i}

represents the global context encoder, G represents the dynamic weight generation function, ⨁ represents the multi-level feature fusion operator, and ⊙ represents an element-wise (Hadamard) product operation.

The ADMBlock adopts a dual-branch residual learning architecture that integrates an adaptive scaling mechanism. It achieves unified multi-scale spatial feature extraction and inter-channel information interaction. The ADMBlock structure is shown in Figure 3. This module decomposes complex feature learning tasks into two collaborative sub-processes: first conducting multi-scale spatial dependency modeling through MDIFusion, and then utilizing the AGFUnit to achieve nonlinear feature enhancement in the channel dimension. Each sub-process is equipped with independent batch normalization layers for feature standardization. MDIFusion achieves adaptive fusion of multi-scale features through a multi-branch parallel convolution structure, with the mathematical expression

MDIFusion (X) = C_{1 \times 1} (⨁_{k = 1}^{K} {AIDConvolution}_{k} ({Split}_{k} (X; π_{k})))

(2)

where K represents the number of multi-scale branches, and

π_{k}

specifies the channel-splitting strategy for the k-th branch. The AGFUnit adopts a gated linear unit mechanism to achieve efficient inter-channel feature interaction, with the calculation process expressed as

AGFUnit (X) = X + F_{o u t} (F_{d w c o n v} (F_{i n} (X)) \otimes Sigmoid (F_{g a t e} (X)))

(3)

where

{Split}_{k} (\cdot; π_{k})

represents the channel-splitting operation according to strategy

π_{k}

,

C_{1 \times 1}

represents a

1 \times 1

convolution transformation,

F_{i n}, F_{o u t}, F_{d w c o n v}, F_{g a t e}

represent input transformation, output transformation, depthwise convolution, and gate transformation functions respectively, and ⊗ represents an element-wise gating operation.

AIDConvolution constructs a heterogeneous multi-branch depthwise separable convolution architecture. Adaptive extraction and fusion of features from different spatial directions and scales are utilized. This module includes three parallel branches with different geometric receptive fields: a square convolution kernel branch

B_{s q u a r e}

to capture complete spatial context information in local regions, a horizontal strip convolution kernel branch

B_{h o r i z o n t a l}

specifically to model linear structural features in the horizontal direction, and a vertical strip convolution kernel branch

B_{v e r t i c a l}

that focuses on extracting spatial dependency relationships in the vertical direction. The output of each branch undergoes adaptive weighted fusion through a dynamic weight allocation mechanism based on global statistical information. AIDConvolution is detailed in Equations (A1) and (A2) in Appendix A.

The ADMPNet backbone network enhances multi-scale defect feature extraction using a dual-branch residual architecture and dynamic weight mechanism implemented in the ADMBlock. This architecture integrates MDIFusion and AIDConvolution to achieve adaptive fusion of heterogeneous multi-branch features. The AGFUnit enhances the inter-channel information interaction capability through gating mechanisms. ADMPNet improves the feature representation capability while ensuring training stability, providing high-quality multi-scale semantic features for the RT-DETR detection head.

3.2. Hierarchical Dynamic Feature Pyramid Network (HDFPN)

The traditional FPN has three main limitations for wind turbine surface defect detection. First, linear-interpolation-based upsampling causes semantic conflicts when fusing multi-scale features. Second, unidirectional top-down propagation ignores global contextual relationships across different scales. Third, fixed-weight fusion strategies cannot adapt to varying feature importance. To address these issues, this paper proposes a Hierarchical Dynamic Feature Pyramid Network (HDFPN). It achieves hierarchical modeling and precise integration of multi-scale defect features through cross-scale global context representation, recursive channel–spatial collaboration mechanisms, and dynamic adaptive fusion strategies, improving detection accuracy for complex defects such as tiny cracks and edge corrosion.

The HDFPN adopts a hierarchical processing architecture of global context aggregation, recursive feature enhancement, and dynamic adaptive fusion. It achieves precise modeling and integration of multi-scale defect features through four core modules. As shown in Figure 4, the network starts with three pyramid-level features, P3, P4, and P5, from the backbone network. The Pyramid Adaptive Context Extraction (PACE) module first constructs cross-scale global context representation. It uses adaptive spatial alignment to unify different-resolution features and employs recursive channel attention to capture cross-scale semantic dependencies, alleviating semantic heterogeneity among multi-scale features. The Global Relationship Module (GRM) optimizes context features at each pyramid level. It suppresses background noise and enhances defect region identification through channel–spatial decoupling and iterative anisotropic attention. The AMDF module performs proportionate fusion based on high-level feature semantics. It creates spatial attention masks to determine the relative importance of detail-level features when fusing with semantic-level features, enhancing both semantic information and detail texture in defect areas. The Dynamic Adaptive Interpolation Fusion achieves content-adaptive fusion of cross-level features through learnable convolution parameters, overcoming traditional fixed-weight interpolation limitations. The recursive feature propagation and fusion process can be formally expressed as

F_{o u t}^{(l)} = Φ_{A M F B}^{(l)} (Θ_{D A I F}^{(l)} (Ψ_{G R M}^{(l)} (C_{l}), U (F_{l + 1})), C_{l}), C_{l} = π_{l} (Θ_{P A C E} (F_{3 : 4 : 5})), l \in {3, 4, 5}

(4)

where

Θ_{P A C E} (\cdot)

represents the pyramid context extraction operator,

Ψ_{G R M}^{(l)} (\cdot)

is the recursive calibration transformation,

Θ_{D A I F}^{(l)}

is the dynamic interpolation fusion operator,

Φ_{A M F B}^{(l)} (\cdot, \cdot)

is the dynamic fusion operator,

U

represents bilinear upsampling,

C_{l}

is the layer context feature l,

π_{l}

represents a feature projection operation at layer l, and

F_{3 : 4 : 5}

denotes the concatenation of feature maps from pyramid levels 3, 4, and 5. This recursive framework achieves progressive transfer of high-level semantics and hierarchical integration of multi-scale features.

PACE alleviates semantic gap problems of multi-scale features by constructing a cross-scale global context representation. The module operates through two main steps. First, it employs pyramid pooling aggregation to spatially align input multi-scale features. Second, it introduces recursive channel attention for n iterations of feature refinement. Each iteration uses depthwise separable convolution to extract local spatial patterns, combines horizontal and vertical global pooling to capture anisotropic spatial features, and performs channel excitation through two-stage strip convolution. The pyramid context extraction process is represented as

C_{l} = π_{l} (Θ_{P A C E} (F_{3 : 4 : 5})) = Split (R^{n} (⨁_{F_{i} \in {3, 4, 5}} A_{p o o l} (F_{i}, S_{m i n})), {C_{3}, C_{4}, C_{5}})

(5)

where

A_{p o o l} (\cdot, s)

represents adaptive pooling to size s, ⨁ represents an element-wise addition operation,

S_{m i n}

denotes the minimum spatial size for feature alignment,

R^{n} (\cdot)

is n iterations of recursive calibration, n is set to 3 in our implementation, and

Split (\cdot)

splits features by specified channel numbers.

The iterative process of the Gradient Refinement Module (GRM) integrates depthwise convolution, anisotropic pooling, and strip convolution excitation, with the structure shown in Figure 5. The workflow can be expressed as follows:

Ψ_{G R M} = X + γ ⊙ Φ_{F F N} (χ ⊙ σ (W_{b a n d} (P_{h} (D (X)) \oplus P_{w} (D (X)))))

(6)

where

D (\cdot)

is a depthwise separable convolution extracting local spatial features,

P_{h} (\cdot)

and

P_{w} (\cdot)

represent horizontal and vertical global average pooling respectively, ⊕ is element-wise addition,

W_{b a n d} (\cdot)

represents cascaded strip convolution sequences

1 \times k \to k \times 1

for channel excitation,

σ (\cdot)

is the sigmoid activation function,

Φ_{F F N} (\cdot)

represents the feedforward network (including BN layer and MLP),

γ

is a learnable channel-level scaling parameter,

χ

is a learnable scaling parameter, and ⊙ is the Hadamard product.

The Adaptive Multi-scale Fusion Block designs an asymmetric feature fusion mechanism based on high-level semantic guidance. It uses high-level features as spatial attention sources to adaptively weight low-level features. The structure is shown in Figure 6. The module receives low-resolution low-level features

F_{l}

and high-resolution high-level features

F_{h}

. Channel mapping is performed on both inputs through two

1 \times 1

convolutions. Hard-sigmoid activation is then applied to high-level features to generate normalized spatial weight masks. The masks are upsampled to low-level feature resolution through bilinear interpolation. Finally, semantic-guided detail preservation is achieved through element-wise multiplication. The fusion process is expressed as

Φ_{A M F B} = W_{l} (F_{l}) ⊙ J_{u p} (h - σ (W_{h} (F_{h})), (H_{l}, W_{l}))

(7)

where

W_{l}, W_{h}

are

1 \times 1

convolution kernels,

h - σ (x)

is the hard-sigmoid function,

J_{u p} (\cdot, s)

is bilinear upsampling to size s, and ⊙ is the Hadamard product. This asymmetric strategy effectively suppresses low-level feature noise while preserving key details through high-level semantic guidance.

Dynamic Adaptive Interpolation Fusion achieves adaptive fusion of cross-level features through learnable convolution parameters. It overcomes the limitations of fixed-weight fusion. The structure is shown in Figure 7. The module receives two features,

F_{l o w}

and

F_{h i g h}

, from adjacent levels. First, channel transformation and semantic mapping are performed on high-level features using

1 \times 1

convolution. Then it is upsampled to the spatial scale of low-level features through bilinear interpolation. Finally, fusion is performed in residual form. The dynamic fusion process is expressed as

Θ_{D A I F} = F_{l o w} + C_{1 \times 1} (I_{b i l i n e a r} (F_{h i g h}, size (F_{l o w})))

(8)

where

C_{1 \times 1}

is a parameterized

1 \times 1

convolution transformation, and

I_{b i l i n e a r} (\cdot, s)

is bilinear interpolation to size s.

Through collaborative integration of PACE, GRM, Adaptive Multi-scale Feature Fusion, and Dynamic Adaptive Interpolation Fusion, the HDFPN effectively addresses the traditional FPN’s limitations in wind turbine defect detection. The approach demonstrates improvements in handling semantic conflicts, preserving gradient information, and adapting to diverse defect characteristics. The HDFPN achieves robust feature representation for complex defect patterns through cross-scale global context modeling and recursive feature enhancement. The dynamic adaptive fusion strategy automatically adjusts feature importance based on content characteristics, enhancing detection accuracy for challenging defects such as micro-cracks and surface anomalies. Compared to conventional fixed-weight fusion methods, the HDFPN provides improved multi-scale feature integration performance while maintaining computational efficiency.

3.3. Dynamic Frequency-Domain Feature Encoder (DFDEncoder)

The AIFI encoder has several limitations in detecting wind turbine blade surface defects. First, the encoder pays insufficient attention to small or subtle defects when other interferences are present in industrial environments. Second, the AIFI encoder’s feedforward network architecture only performs feature transformation in the spatial domain. It cannot capture frequency-domain characteristics of surface defects. To address these two issues, this paper proposes a DFDEncoder. It achieves frequency-domain feature enhancement through integrating Frequency-Domain Feedforward Networks (FDFNetworks) and Dual-path Adaptive Feature Extractors (DAFE) architectures. The approach employs the Dynamic Tanh [39] process. This improves the accuracy and robustness of wind turbine surface defect detection. The DFDEncoder structure is shown in Figure 8.

The DFDEncoder fully extracts and encodes defect features by integrating three different technical modules. The encoder employs a dual-path approach based on residual connections. The first path contains a multi-head self-attention mechanism and Dynamic Tanh normalization for processing the global context of the data. The DAFE module provides enhanced multi-scale feature extraction. The second path processes data in the frequency domain, employing FDFNetworks and its ability to capture defects’ spectral characteristics. The two output paths utilize residual connections to generate one output. The combined output undergoes further, final optimization through another DAFE module and Dynamic Tanh normalization. The mathematical representation of the overall encoding can be expressed as follows:

Z_{o u t} = {DAFE}_{2} (Dynamic Tanh ({DAFE}_{1} (Dynamic Tanh (X + MSA (X))) + FDFNetworks (\cdot)))

(9)

where X represents the input feature tensor,

MSA (\cdot)

is a multi-head self-attention operation,

FDFNetworks (\cdot)

is the frequency-domain feedforward network,

{DAFE}_{1} (\cdot)

and

{DAFE}_{2} (\cdot)

represent two dual-path adaptive feature extractors, and

Dynamic Tanh (\cdot)

is a Dynamic Tanh operation. FDFNetworks extends traditional spatial-domain convolution operations to the frequency domain. It identifies defect spectral feature patterns through the fast Fourier transform. The module workflow consists of five primary processes. First,

1 \times 1

convolution is performed to expand channels and change input feature dimensions. Second, dilated depthwise separable convolution is used to learn local spatial feature patterns while maintaining an increased receptive field and reduced parameters. Third, adaptive padding is applied to features to conform to the FFT algorithm. Fourth, spectral-domain enhancement operations are conducted in the frequency-domain space using learnable frequency-domain weights

W_{f r e q}

and learnable frequency-domain biases

B_{f r e q}

, enabling adaptive modulation of various frequency components. Finally, the inverse fast Fourier transform is performed through a SiLU activation function and gating mechanism. FDFNetworks, detailed in Equations (A3) and (A4) in Appendix A, thereby improves the perception capability for edge and textural features.

DAFE is a lightweight multi-scale feature adapter. It is constructed through parallel multi-scale depthwise convolution operations and dual residual connection mechanisms. DAFE adopts an innovative dual residual structure. The outer residual connection ensures stable feature transfer, while the inner residual connection achieves deep fusion of multi-scale features through the hierarchical multi-scale module (DAFE-HMS). DAFE first performs feature normalization through LayerNorm2d. Then, it uses learnable scaling factors

β

(initialized to

10^{- 6}

) and

α

(initialized to 1) to weight and combine normalized and original features, ensuring training stability in early stages. After dimensionality reduction through

1 \times 1

convolution, the features are sent to the DAFE-HMS module for multi-scale processing. This sub-module extracts multi-scale features through three depthwise separable convolutions of different sizes (3 × 3, 5 × 5, 7 × 7) in parallel. Then it performs feature fusion through average pooling. DAFE and DAFE-HMS are detailed in Equations (A5) and (A6) in Appendix A, respectively.

Compared to traditional AIFI encoders, the DFDEncoder achieves frequency-domain enhancement through dual-path adaptive feature extractors. It also improves the model’s ability to accurately extract texture and edge components from visual images. Finally, by introducing Dynamic Tanh, the DFDEncoder enhances adaptability in a wide range of industrial environments. Therefore, the DFDEncoder achieves gains in detecting wind turbine surface defects by combining FDFNetworks and DAFE.

4. Experimental Analysis

4.1. Dataset

This paper integrates a large-scale wind turbine dataset including blades, nacelles, and towers. Dataset images come from diverse sources, with one part sourced from the DTU [40] dataset and another part being UAV-collected blade defect images [41]. To simulate harsh weather conditions that UAVs may encounter during image collection, this paper randomly applies image augmentation operations, including rain, snow, and sandstorms, to some data. These weather simulations use pixel-level enhancement strategies that apply color-space transformations, texture overlays, and local blurring to the original image matrix without any geometric transformations or resampling operations, thus preserving the original image resolution. Image augmentation effects are shown in Figure 9.

The dataset is annotated using Labellmg and, based on wind turbine surface defect characteristics, mainly includes seven defect categories: burning, crack, deformity, dirt, oil, peeling, and rusty. The number and proportion of defect categories in the dataset are shown in Figure 10. Defect targets are divided into three scale levels in this paper: small-size targets (

area < 32 \times 32

pixels), medium-size targets (

32 \times 32 \leq area \leq 96 \times 96

pixels), and large-size targets (

area > 96 \times 96

pixels). Statistical results show that large-size defects account for 53.6% (5252) of the dataset, medium-size defects account for 42.0% (4122), and small-size defects account for 4.4% (430). This distribution characteristic matches actual scenarios of wind turbine surface defect detection. The dataset is randomly divided into training, validation, and test sets containing 2938, 380, and 762 images, respectively, which corresponds to an approximate 7:1:2 split ratio.

4.2. Experimental Parameters

The experimental platform configuration for the model in this paper is as follows: the CPU is an Intel Xeon Gold 6530 processor, and the GPU is an NVIDIA L20 Tensor Core (48 G memory), using the Ubuntu 18.04 operating system. All experiments were conducted using CUDA 12.1, Pytorch 2.2.2, and Python 3.10.14. Deep learning models were all built based on MMDetection and Pytorch frameworks.

To ensure fairness of experimental results, the proposed model, comparison models, and various baseline models in this paper adopt the following hyperparameter settings: AdamW optimizer is used for model parameter updates, the initial learning rate is set to

1 \times 10^{- 4}

, the final learning rate decay factor is set to 1.0, and the weight decay coefficient is set to

1 \times 10^{- 4}

. In training settings, the batch size is 8, the input image size is uniformly adjusted to

640 \times 640

pixels, and model training epochs are set to 350 epochs, but the model typically converges around 300 epochs. To ensure determinism of the training process, the random seed is set to 0, and deterministic mode is enabled to ensure reproducibility of experimental results. All models were trained from scratch without pre-trained weights.

4.3. Evaluation Metrics

To provide a comprehensive analysis of the full potential of HAMD-DETR, this study includes data from six different evaluation indicators. These indicators are divided into three independent dimensions: technical efficiency, model accuracy, and efficiency. For technical complexity evaluation, the GFLOPS metric provides a quantitative measure of computational complexity associated with using the HAMD-DETR model on wind turbine surfaces. It estimates the total number of floating-point operations required during forward propagation. The total number of parameters (Params) is counted as part of the technical evaluation. It affects both the overall storage requirements of the model and the inference speed derived from the model.

Precision (P) and recall (R) are the primary indicators of model detection accuracy during model evaluation. The ratio of true positives to total predicted positives is used to determine precision (P) and reflects how accurate the predictions from the model are. Recall (R) is the ratio of true positives to total actual positives and measures the extent to which the model correctly identifies defect targets. In addition, mAP50 and mAP50-95 are used as comprehensive performance evaluation indicators. MAP50 calculates the average precision mean of each category when the IoU threshold is 0.5, which is a standard evaluation indicator widely adopted in object detection tasks. MAP50-95 calculates the average precision mean over IoU thresholds ranging from 0.5 to 0.95 with a step of 0.05. This indicator has stricter requirements for localization accuracy and can more comprehensively reflect the overall detection performance of the model. Through comprehensive analysis of multi-dimensional indicators, the practical value and technical advantages of the model in wind turbine defect detection applications can be accurately reflected.

4.4. Ablation Experiments

4.4.1. Ablation Analysis of ADMPNet

To verify the effectiveness of the ADMBlock in the proposed ADMPNet backbone network, we conducted systematic ablation experiments on its convolution kernel size configuration. Experimental results are shown in Table 1 and Figure 11. This experiment quantitatively evaluates the performance contribution of multi-scale convolution kernel design to wind turbine surface defect detection tasks by setting four different kernel combinations (K = 1, 3, 5, 7), corresponding to convolution kernel configurations of

B_{s q u a r e}

,

B_{h o r i z o n t a l}

, and

B_{v e r t i c a l}

branches respectively.

According to Table 1 and Figure 11, experimental results show that the model’s detection capability changes with different convolution kernel size configurations. Among several useful performance metrics, we observe a consistent upward trend in performance as kernel size increases from K = 1 to K = 5. Performance improvements include a 3.0% increase in mAP50 and a 5.3% increase in mAP50-95. Furthermore, when kernel size reaches K = 7, all performance metrics decline. This confirms that adopting a multi-scale receptive field design in our ADMPNet architecture is reasonable. The optimal performance of the K = 5 configuration can be attributed to its ability to maximize the balance between receptive field coverage and feature representation capability. It avoids the parameter redundancy and overfitting risks usually associated with oversized convolution kernels. Therefore, the K = 5 convolution kernel configuration provides the best compromise between detection accuracy and computational complexity.

4.4.2. Full-Model Ablation Analysis

To systematically verify the effectiveness of each module in the HAMD-DETR framework, we designed comprehensive ablation experiments by progressively integrating the three core modules—the ADMPNet backbone network, HDFPN neck network, and DFDEncoder—to quantitatively evaluate the independent contribution and synergistic effects of each component on detection performance. All experiments were conducted under the same training settings and hyperparameter configurations to ensure fairness of comparisons. Experimental results are shown in Table 2 and Figure 12.

Through experimental results, the effectiveness of each module can be fully demonstrated. While achieving lightweighting, the ADMPNet backbone network improves detection performance by 1.7%, with GFLOPs reduced by 23.5% and parameter quantity reduced by 33.5%. This is because the multi-scale dynamic integration fusion mechanism of the ADMBlock can adaptively extract defect features of different sizes, thus adaptively selecting optimal feature extraction strategies. The HDFPN neck network improves precision by 3.6% and mAP50 by 2.0%. The network’s performance benefits from its hierarchical gradient preservation mechanism, which effectively alleviates semantic gaps of multi-scale features. The pyramid context extraction module solves small-scale defect feature attenuation problems through cross-scale global modeling. Adding the DFDEncoder provides a 1.9% improvement in the mAP50 metric while, at the same time, adding an additional 0.2 M parameters and an additional 0.13 GFLOPs to the overall model complexity. The DFDEncoder combines the computational efficiency of using FFT to convert spatial features into frequency components, allowing it to achieve global-level feature modeling at a lower computational expense compared to stacking several convolutional layers for equivalent receptive fields. By evaluating features in the frequency domain, the DFDEncoder can immediately access global frequencies of the feature maps, which helps in detecting high-frequency textures and edges associated with surface defects found on the blades of wind turbines. The DFDEncoder includes learnable parameters for modulating the frequency components; thus, the DFDEncoder has the ability to dynamically adjust its frequency weights across varying frequency areas during learning, thereby creating more distinctive features while ensuring that computations remain numerically stable. ADMPNet combined with the HDFPN achieves the highest recall rate of 59.8% with the lowest computational cost. The HDFPN combined with the DFDEncoder improves mAP50 by 2.5%. The complete HAMD-DETR framework achieves optimal comprehensive performance, demonstrating improvements of 3.1% and 2.1% in mAP50 and mAP50-95 respectively, alongside reductions of 27.2% in computational load and 30.0% in parameter quantity. These three modules form a complementary functional chain: ADMPNet captures multi-scale spatial information, the HDFPN maintains gradient flow and balances deep–shallow layer semantics, and the DFDEncoder improves feature discrimination through frequency-domain enhancement. This synergistic action enables the model to accurately identify various defects in complex environments, fully validating the effectiveness of framework design.

4.5. Comparative Experiments

4.5.1. Backbone Network Performance Analysis

To verify the feature extraction capability of the ADMPNet backbone network, we selected the traditional convolutional network ResNet, the Transformer architectures EfficientFormerv2 [42], EfficientViT [43], RepViT [44], and SMAFormer [45], and the lightweight networks FasterNet [46] and StarNet [47] for comparison. All experiments were based on the RT-DETR framework, verifying the detection effects of each network by replacing the backbone network. Experimental results are shown in Table 3 and Figure 13.

Experimental results demonstrate that ADMPNet has advantages in both accuracy and efficiency. Compared with the baseline ResNet, ADMPNet improves mAP50 by 1.7% while reducing GFLOPs by 23.5% and parameter quantity by 33.5%. Compared with Transformer architectures, ADMPNet surpasses EfficientFormerv2, EfficientViT, RepViT, and SMAFormer networks by 1.8%, 2.2%, 2.7%, and 0.6% in mAP50 respectively. ADMPNet’s superior detection performance benefits from the dual-branch residual architecture of the ADMBlock and the heterogeneous multi-branch design of AIDConvolution. Compared with the lightweight networks FasterNet and StarNet, ADMPNet improves mAP50 by 2.3% and 3.3% respectively. Compared to StarNet, mAP50-95 improves by 2.6%. Compared with the computationally largest SMAFormer, ADMPNet achieves better performance with 71% the computational cost. The performance of ADMPNet stems from the efficient combination of MDIFusion and AGFUnit.

4.5.2. Neck Network Performance Analysis

To verify the multi-scale feature integration capability of the HDFPN neck network, we selected the lightweight network Slimneck [48], the cross-scale fusion networks CSFCN [49], RFPN [50], WFU [51], and CGAFusion [52], and the attention-enhanced network HVI [53] for comparison. All experiments were based on the RT-DETR framework, verifying feature fusion effects by replacing the neck network. Experimental results are shown in Table 4 and Figure 14.

Experimental results show that the HDFPN performs well in both accuracy and efficiency. Compared with the baseline CCFF, the HDFPN improves mAP50 by 1.9% and precision by 3.6%, while reducing GFLOPs by 15.4%. This indicates that the HDFPN’s channel cross-scale modeling can alleviate semantic differences in multi-scale features. Compared with the lightweight Slimneck, the HDFPN improves mAP50 by 2.5%, validating the effectiveness of the gradient preservation mechanism. In cross-scale fusion network comparisons, the HDFPN surpasses CSFCN, RFPN, WFU, and CGAFusion networks by 1.2%, 1.5%, 1.3%, and 1.6% respectively. Compared with HVI, the HDFPN improves mAP50 by 1.1%, with the highest precision reaching 73.8%. This indicates its effectiveness in reducing false detections. The HDFPN achieves optimal performance with the lowest computational cost, fully demonstrating the effectiveness of cross-scale feature expression enhancement and balanced fusion of deep–shallow semantics.

4.5.3. Comparison with Mainstream Object Detection Methods

Current mainstream object detection methods have limitations in wind turbine surface defect detection. Traditional CNN methods have fast inference speed but insufficient expression capability for multi-scale, complex defect features. Transformer-based methods possess powerful global modeling capability. However, they have high computational complexity and slow inference speed problems. The HAMD-DETR model proposed in this paper combines three architectures: adaptive dynamic multi-scale feature extraction, hierarchical feature fusion, and dynamic frequency-domain features. This enables it to effectively balance detection accuracy and efficiency while maintaining detection accuracy. To verify model effectiveness, we selected representative two-stage detectors, YOLO series models and Transformer architecture-based detectors, for comparative experiments. HAMD-DETR is an improved real-time Transformer detector. All comparative experiments were conducted under the same environment and evaluation metrics to ensure experimental fairness. Specific experimental data are shown in Table 5 and Figure 15.

Selecting RT-DETR as the core baseline model is mainly based on the following considerations. First, as a cutting-edge real-time Transformer architecture, RT-DETR achieves a good balance between accuracy and speed. Therefore, it is particularly suitable for the practical application needs of wind turbine surface defect detection. Second, RT-DETR’s encoder adopts a hybrid architecture design. It combines a CNN’s local feature extraction advantages with Transformer’s global modeling capability. This provides a good foundational framework for our architecture optimization. Finally, compared with the YOLO series, which relies on a non-maximum suppression (NMS) post-processing detection pipeline, RT-DETR adopts an end-to-end detection architecture. It avoids hyperparameter tuning complexity and inference delays introduced by NMS. This better conforms to stability requirements of industrial deployment.

As shown by experimental data, HAMD-DETR achieves improvements in detection accuracy, fully validating the effectiveness of the framework design. To ensure reproducibility, we conducted the experiments using three different random seeds (0, 42, 2024). The mAP50 and mAP50-95 of HAMD-DETR were

58.6 % \pm 0.4 %

and

31.7 % \pm 0.3 %

respectively. The mAP50 and mAP50-95 of the baseline RT-DETR R18 were

55.5 % \pm 0.5 %

and

29.6 % \pm 0.4 %

respectively. Compared with the baseline model RT-DETR R18, HAMD-DETR improves mAP50 and mAP50-95 by 3.1% and 2.1% respectively. Compared with Faster R-CNN [54], YOLOv5m [55], YOLOv8m [56], YOLOv10m [57], YOLOv11m [58], and YOLOv12m [59], HAMD-DETR improves mAP50 by 9.6%, 3.8%, 3.9%, 8.9%, 5%, and 4.9% respectively. Meanwhile, compared with advanced DETR methods such as DINO [60] and RT-DETR-R50, HAMD-DETR not only improves mAP50 by 6.5% and 3.0% respectively but also has 41.5 GFLOP computation and 13.91 M parameters, lower than RT-DETR R50’s 129.9 GFLOPs and 42.11 M parameters. This fully demonstrates that HAMD-DETR achieves the optimal balance between accuracy and efficiency in wind turbine defect detection tasks. In terms of inference speed, HAMD-DETR reaches 151.9 FPS, with real-time performance superior to most mainstream methods, completely meeting practical requirements.

4.6. Visual Analysis

To verify the performance of the HAMD-DETR framework in wind turbine defect detection tasks under complex environments, we conducted detailed visualization comparative experiments. Through intuitive detection effect comparisons with the current mainstream detection methods YOLOv11m and RT-DETR R18, we comprehensively evaluated HAMD-DETR’s detection performance under different environmental and target states. Visualization results are shown in Figure 16.

As shown in Figure 16, HAMD-DETR demonstrates effective detection performance in various complex scenarios. From Figure 16a, in complex conventional environments with small defect image features, YOLOv11m and RT-DETR R18 both fail to identify defects. HAMD-DETR adopts the Adaptive Dynamic Multi-scale Perception mechanism of the ADMPNet network. Through the dual-branch residual learning architecture of the ADMBlock, it enhances the representation capability of multi-scale defect features and improves the model’s anti-interference capability. Therefore, HAMD-DETR successfully locates defect feature information positions. From Figure 16b,c, under rainfall and snowfall conditions, YOLOv11m and RT-DETR R18 exhibit false detection and missed detection problems. HAMD-DETR utilizes the hierarchical gradient preservation feature fusion mechanism of the HDFPN network structure. Through the pyramid context extraction structure, it establishes a cross-scale global context representation. Combined with recursive calibration module iterative feature refinement and Adaptive Multi-scale Fusion Block high-level semantic guidance, it effectively alleviates semantic gap problems of multi-scale features and achieves balanced fusion of deep–shallow features. From Figure 16c,d, in sandy and dusty environments, where defect features are severely interfered with, YOLOv11m fails to locate defect information. RT-DETR R18 mistakenly identifies the wind turbine body structure as defects. HAMD-DETR adopts the polarization spectrum enhancement multi-head attention mechanism of the DFDEncoder. It utilizes frequency-domain feedforward networks to perform feature enhancement in the frequency-domain space. Therefore, it achieves refined enhancement of important defect features and fully improves the model’s precise localization capability.

As shown in Figure 17, HAMD-DETR has advantages in target focusing capability compared to RT-DETR. Specifically, from Figure 17a, the image contains defect features of various shapes. HAMD-DETR constructs a heterogeneous multi-branch convolution architecture through the AIDConvolution module of ADMPNet. It uses dynamic weight allocation to achieve precise focusing. Therefore, its perception effect is better. From Figure 17b, YOLOv11m and RT-DETR R18 models show imprecise defect position recognition. YOLOv11m identification shows a deviation, while RT-DETR R18 is affected by interference from background features and deviates from the perception region. HAMD-DETR adopts the HDFPN feature fusion structure. Its recursive calibration and dynamic adaptive fusion mechanism achieves strengthened fusion of deep semantic and shallow detail features. This improves the perception effects of different defect categories. Therefore, HAMD-DETR’s perception effect is more targeted than other models. From Figure 17c, YOLOv11m and RT-DETR R18 show imprecise perception in some regions with incorrect perception areas. HAMD-DETR adopts the DFDEncoder, which combines frequency-domain features with adaptive mechanisms. Even for weak target features, it can obtain clearer attention responses.

4.7. Generalization Validation

Supplemental experiments on the NEU-DET [61] dataset were conducted to confirm the generalization performance and strength of HAMD-DETR when exposed to a variety of data distributions. The NEU-DET dataset was created by Northeastern University and includes six different classes of defects for hot-rolled steel strips: rolled-in scale, patches, crazing, pitted surfaces, inclusions, and scratches. A total of 1800 grayscale images exist in this dataset, with 300 samples per defect class, at an original resolution of 200 × 200 pixels.

The NEU-DET dataset provides a useful alternative for validating generalization ability due to the fact that its imaging conditions, defect scales, and intraclass similarities differ from those of the wind turbine blade surface defects dataset, since both datasets are designed to assess a model’s ability to adapt across domains.

The results of the experiments are summarized in Table 6. In terms of detection performance, HAMD-DETR produced 76.4% precision and 65.2% recall, while mAP50 and mAP50-95 were 73.2% and 40.1%, which represent increases of 2.6% and 1.0% respectively. Therefore, HAMD-DETR has high accuracy relative to other models and a reduction in model complexity.

Overall, the results of these experiments indicate that HAMD-DETR has superior detection performance for both the wind turbine blade defect dataset and the NEU-DET steel defect dataset and maintains stable generalization in various data distributions and application environments. This indicates the success and utility of the algorithm proposed.

5. Conclusions

This paper presents HAMD-DETR, a novel detector specifically designed to address the challenges of complex defect detection on wind turbine surfaces. Through the collaborative design of three innovative modules, this framework achieves high-precision real-time detection capabilities. The ADMPNet backbone network enhances multi-scale feature perception through its adaptive dynamic multi-branch architecture, the HDFPN fusion network achieves cross-scale semantic consistency modeling using hierarchical gradient preservation mechanisms, and the DFDEncoder improves feature expression efficiency through advanced spectral enhancement techniques.

Through our comprehensive evaluation, we have demonstrated that the integration of these three modules enhances detection performance across varying scales and complex industrial environments. The model’s performance is evidenced by substantial improvements in both mAP50 and mAP50-95 metrics, achieving gains of 3.1% and 2.1% respectively compared to the RT-DETR R18 baseline method, reaching 58.6% for mAP50 and 31.7% for mAP50-95. Beyond enhanced accuracy, HAMD-DETR maintains exceptional computational efficiency, with only 41.5 GFLOPS and 13.91 M parameters, while delivering an impressive inference speed of 151.9 FPS. These comprehensive results validate the method’s effectiveness in establishing an optimal balance between detection accuracy and computational efficiency compared to other DETR models. The ablation studies confirm the contribution of each individual module, while comparative experiments with mainstream DETR models demonstrate the framework’s adaptability, robustness, and capability for accurate predictions in complex industrial scenarios.

Our future work will focus on several important directions to address current limitations and expand practical applications. We plan to design more lightweight network architectures to meet edge device deployment requirements, enabling broader industrial adoption. To improve detection capabilities, we will construct large-scale, diverse defect datasets that encompass a wider range of defect categories and environmental conditions. Additionally, we aim to extend this methodology to other new energy equipment intelligent operation and maintenance scenarios, thereby promoting the industrialization of advanced visual detection technology in real-world applications.

Author Contributions

Conceptualization, S.T. and P.Z.; methodology, S.T.; software, S.T.; validation, S.T., L.L. and P.Z.; formal analysis, S.T.; investigation, S.T.; resources, L.L. and P.Z.; data curation, S.T.; writing—original draft preparation, S.T.; writing—review and editing, S.T., L.L. and P.Z.; visualization, S.T.; supervision, L.L. and P.Z.; project administration, P.Z.; funding acquisition, P.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China Youth Science Fund Program (No. 61802247) and the Shanghai Natural Science Foundation General Program (No. 22ZR1425300).

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

RT-DETR	Real-Time Detection Transformer
DETR	Detection Transformer
FPN	Feature Pyramid Network
HAMD-DETR	Hierarchical Adaptive Multi-scale Dynamic Detection Transformer
ADMPNet	Adaptive Dynamic Multi-scale Perception Network
ADMBlock	Adaptive Dynamic Mixed Block
MDIFusion	Multi-scale Dynamic Integration Fusion
AIDConvolution	Adaptive Inception Depthwise Convolution
AGFUnit	Adaptive Gated Feature Unit
HDFPN	Hierarchical Dynamic Feature Pyramid Network
PACE	Pyramid Adaptive Context Extraction
GRM	Gradient Refinement Modules
DFDEncoder	Dynamic Frequency-Domain Feature Encoder
FDFNetworks	Frequency-Domain Feedforward Networks
DAFE	Dual-path Adaptive Feature Extractors
DAFE-HMS	Dual-path Adaptive Feature Extractors-hierarchical multi-scale module

Appendix A

A I D C o n v o l u t i o n = σ (BN (\sum_{k \in {s, h, v}} α_{k} ⊙ B_{k} (X; K_{k}, P_{k})))

(A1)

where

B_{k} (\cdot; K_{k}, P_{k})

represents the depthwise separable convolution operation of the k-th branch,

K_{k}

and

P_{k}

represent convolution kernel configuration and padding strategy respectively, BN denotes batch normalization,

σ

represents the SiLU activation function, and

s, h, v

represent square, horizontal, and vertical.

α_{k} = Softmax (Reshape (W_{2} \circ ReLU (W_{1} \circ GAP (X)), 3, C))

(A2)

where

W_{1}, W_{2}

represent learnable parameter matrices of the dynamic weight generation network, GAP represents a global adaptive average pooling operation, and ∘ denotes matrix multiplication.

Z_{FDFNetworks} = {Conv}_{1 \times 1}^{out} (SiLU (Z_{1}) ⊙ Z_{2})

(A3)

Z_{1}, Z_{2} = Split (Unpad (F^{- 1} (W_{freq} ⊙ F ({DWConv}_{3 \times 3}^{d = 2} ({Conv}_{1 \times 1}^{in} (X))) + B_{freq})))

(A4)

where

F (\cdot)

and

F^{- 1} (\cdot)

represent 2D fast Fourier transform and inverse transform operations respectively,

W_{f r e q}

and

B_{f r e q}

are learnable frequency-domain modulation parameters,

{DWConv}_{3 \times 3}^{d = 2} (\cdot)

represents depthwise separable convolution with dilation rate 2,

SiLU (\cdot)

is the SiLU activation function, ⊙ represents an element-wise Hadamard product operation, and Split divides the feature tensor into two equal parts along the channel dimension.

Z_{D A F E} = X + {Conv}_{1 \times 1}^{p r o j 2} (σ_{G E L U} (DAFE - HMS ({Conv}_{1 \times 1}^{p r o j 1} (LN (X) \cdot β + X \cdot α))))

(A5)

DAFE - HMS (Z) = Z + {Conv}_{1 \times 1}^{m e r g e} (Z + \frac{1}{3} \sum_{k \in {3, 5, 7}} {Conv}_{k \times k}^{d w} (σ (Z)))

(A6)

where

LN (\cdot)

represents a LayerNorm2d normalization operation,

β, α \in R^{C \times 1 \times 1}

are learnable feature scale parameters,

σ_{G E L U} (\cdot)

is the GELU activation function,

{Conv}_{k \times k}^{d w} (\cdot)

represents depthwise separable convolution with kernel size

k \times k

,

{Conv}_{1 \times 1}^{p r o j 1, p r o j 2} (\cdot)

represents different

1 \times 1

projection convolution layers, and

{Conv}^{m e r g e}

represents the merging convolution layer.

References

Liu, H.; Wang, Y.; Zeng, T.; Wang, H.; Chan, S.C.; Ran, L. Wind turbine generator failure analysis and fault diagnosis: A review. IET Renew. Power Gener. 2024, 18, 3127–3148. [Google Scholar] [CrossRef]
Garolera, A.C.; Madsen, S.F.; Nissim, M.; Myers, J.D.; Holboell, J. Lightning damage to wind turbine blades from wind farms in the US. IEEE Trans. Power Deliv. 2016, 31, 1043–1049. [Google Scholar] [CrossRef]
Mishnaevsky, L., Jr.; Hasager, C.B.; Bak, C.; Tilg, A.-M.; Bech, J.I.; Rad, S.D.; Fæster, S. Leading edge erosion of wind turbine blades: Understanding, prevention and protection. Renew. Energy 2021, 169, 953–969. [Google Scholar] [CrossRef]
Gohar, I.; Yew, W.K.; Halimi, A.; See, J. Review of state-of-the-art surface defect detection on wind turbine blades through aerial imagery: Challenges and recommendations. Eng. Appl. Artif. Intell. 2025, 144, 109970. [Google Scholar] [CrossRef]
Katsaprakakis, D.A.; Papadakis, N.; Ntintakis, I. A comprehensive analysis of wind turbine blade damage. Energies 2021, 14, 5974. [Google Scholar] [CrossRef]
Best, O.; Khan, A.; Sharma, S.; Collins, K.; Gianni, M. Leading Edge Erosion Classification in Offshore Wind Turbines Using Feature Extraction and Classical Machine Learning. Energies 2024, 17, 5475. [Google Scholar] [CrossRef]
Ulriksen, M.D.; Tcherniak, D.; Kirkegaard, P.H.; Damkilde, L. Operational modal analysis and wavelet transformation for damage identification in wind turbine blades. Struct. Health Monit. 2016, 15, 381–388. [Google Scholar] [CrossRef]
Tiwari, K.A.; Raisutis, R.; Samaitis, V. Hybrid signal processing technique to improve the defect estimation in ultrasonic non-destructive testing of composite structures. Sensors 2017, 17, 2858. [Google Scholar] [CrossRef]
Noshirvani, G.; Askari, J.; Fekih, A. A robust fault detection and isolation filter for the pitch system of a variable speed wind turbine. Int. Trans. Electr. Energy Syst. 2018, 28, e2625. [Google Scholar] [CrossRef]
Cho, S.; Choi, M.; Gao, Z.; Moan, T. Fault detection and diagnosis of a blade pitch system in a floating wind turbine based on Kalman filters and artificial neural networks. Renew. Energy 2021, 169, 1–13. [Google Scholar] [CrossRef]
Rodriguez, P.C.; Marti-Puig, P.; Caiafa, C.F.; Serra-Serra, M.; Cusidó, J.; Solé-Casals, J. Exploratory analysis of SCADA data from wind turbines using the K-means clustering algorithm for predictive maintenance purposes. Machines 2023, 11, 270. [Google Scholar] [CrossRef]
Solimine, J.; Niezrecki, C.; Inalpolat, M. An experimental investigation into passive acoustic damage detection for structural health monitoring of wind turbine blades. Struct. Health Monit. 2020, 19, 1711–1725. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Redmon, J.; Farhadi, A. YOLO9000: Better, faster, stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7263–7271. [Google Scholar]
Farhadi, A.; Redmon, J. YOLOv3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar] [CrossRef]
Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. YOLOv4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar] [CrossRef]
Sarkar, D.; Gunturi, S.K. Wind turbine blade structural state evaluation by hybrid object detector relying on deep learning models. J. Ambient. Intell. Humaniz. Comput. 2021, 12, 8535–8548. [Google Scholar] [CrossRef]
Davis, M.; Dejesus, E.N.; Shekaramiz, M.; Zander, J.; Memari, M. Identification and localization of wind turbine blade faults using deep learning. Appl. Sci. 2024, 14, 6319. [Google Scholar] [CrossRef]
Memari, M.; Shekaramiz, M.; Masoum, M.A.S.; Seibi, A.C. Enhanced Non-Destructive Testing of Small Wind Turbine Blades Using Infrared Thermography. Machines 2025, 13, 108. [Google Scholar] [CrossRef]
Nguyen, P.T.; Huynh, D.C.; Ho, L.D.; Dunnigan, M.W. YOLO-WTB: Improved YOLOv12n model for detecting small damage of wind turbine blades from aerial imagery. IEEE Access 2025, 13, 131257–131270. [Google Scholar] [CrossRef]
Liao, Y.; Lv, M.; Huang, M.; Qu, M.; Zou, K.; Chen, L.; Feng, L. An improved YOLOv7 model for surface damage detection on wind turbine blades based on low-quality UAV images. Drones 2024, 8, 436. [Google Scholar] [CrossRef]
Liu, L.; Li, P.; Wang, D.; Zhu, S. A wind turbine damage detection algorithm designed based on YOLOv8. Appl. Soft Comput. 2024, 154, 111364. [Google Scholar] [CrossRef]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2020; pp. 213–229. [Google Scholar]
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. DETRs beat YOLOs on real-time object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 16965–16974. [Google Scholar]
Mansoor, M.; Tan, X.; Mirza, A.F.; Gong, T.; Song, Z.; Irfan, M. WindDefNet: A Multi-Scale Attention-Enhanced ViT-Inception-ResNet Model for Real-Time Wind Turbine Blade Defect Detection. Machines 2025, 13, 453. [Google Scholar] [CrossRef]
Gao, R.; Ma, Y.; Wang, T. Early stage damage detection of wind turbine blades based on UAV images and deep learning. J. Renew. Sustain. Energy 2023, 15, 044502. [Google Scholar] [CrossRef]
Dwivedi, D.; Babu, K.V.S.M.; Yemula, P.K.; Chakraborty, P.; Pal, M. Identification of surface defects on solar pv panels and wind turbine blades using attention based deep learning model. Eng. Appl. Artif. Intell. 2024, 131, 107836. [Google Scholar] [CrossRef]
Memari, M.; Shekaramiz, M.; Masoum, M.A.S.; Seibi, A.C. Data fusion and ensemble learning for advanced anomaly detection using multi-spectral RGB and thermal imaging of small wind turbine blades. Energies 2024, 17, 673. [Google Scholar] [CrossRef]
Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable DETR: Deformable transformers for end-to-end object detection. arXiv 2020, arXiv:2010.04159. [Google Scholar]
Meng, D.; Chen, X.; Fan, Z.; Zeng, G.; Li, H.; Yuan, Y.; Sun, L.; Wang, J. Conditional DETR for fast training convergence. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 3651–3660. [Google Scholar]
Liu, S.; Li, F.; Zhang, H.; Yang, X.; Qi, X.; Su, H.; Zhu, J.; Zhang, L. DAB-DETR: Dynamic anchor boxes are better queries for DETR. arXiv 2022, arXiv:2201.12329. [Google Scholar] [CrossRef]
Chen, Q.; Chen, X.; Wang, J.; Feng, H.; Han, J.; Ding, E.; Cheng, G.; Zeng, J. Group DETR: Fast DETR training with group-wise one-to-many assignment. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 6633–6642. [Google Scholar]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path aggregation network for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8759–8768. [Google Scholar]
Tan, M.; Pang, R.; Le, Q.V. EfficientDet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10781–10790. [Google Scholar]
Ghiasi, G.; Lin, T.Y.; Le, Q.V. NAS-FPN: Learning scalable feature pyramid architecture for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 7036–7045. [Google Scholar]
Guo, C.; Fan, B.; Zhang, Q.; Xiang, S.; Pan, C. AugFPN: Improving multi-scale feature learning for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 12595–12604. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Zhu, J.; Chen, X.; He, K.; LeCun, Y.; Liu, Z. Transformers without Normalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 11–15 June 2025. [Google Scholar]
Shihavuddin, A.; Chen, X.; Fedorov, V.; Christensen, A.N.; Riis, N.A.B.; Branner, K.; Dahl, A.B.; Paulsen, R.R. Wind turbine surface damage detection by deep learning aided drone inspection analysis. Energies 2019, 12, 676. [Google Scholar] [CrossRef]
Jacky. TURBINE Dataset. Available online: https://universe.roboflow.com/jacky-arczl/turbine-2fmkj (accessed on 12 December 2025).
Li, Y.; Hu, J.; Wen, Y.; Evangelidis, G.; Salahi, K.; Wang, Y.; Tulyakov, S.; Ren, J. Rethinking vision transformers for MobileNet size and speed. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 16889–16900. [Google Scholar]
Liu, X.; Peng, H.; Zheng, N.; Yang, Y.; Hu, H.; Yuan, Y. EfficientViT: Memory efficient vision transformer with cascaded group attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 14420–14430. [Google Scholar]
Wang, A.; Chen, H.; Lin, Z.; Han, J.; Ding, G. RepViT: Revisiting mobile CNN from ViT perspective. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 15909–15920. [Google Scholar]
Zheng, F.; Chen, X.; Liu, W.; Li, H.; Lei, Y.; He, J.; Pun, C.M.; Zhou, S. SMAFormer: Synergistic multi-attention transformer for medical image segmentation. In Proceedings of the 2024 IEEE International Conference on Bioinformatics and Biomedicine, Lisbon, Portugal, 3–6 December 2024; pp. 4048–4053. [Google Scholar]
Chen, J.; Kao, S.; He, H.; Zhuo, W.; Wen, S.; Lee, C.H.; Chan, S.H.G. Run, don’t walk: Chasing higher FLOPS for faster neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 12021–12031. [Google Scholar]
Ma, X.; Dai, X.; Bai, Y.; Wang, Y.; Fu, Y. Rewrite the stars. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 5694–5703. [Google Scholar]
Li, H.; Li, J.; Wei, H.; Liu, Z.; Zhan, Z.; Ren, Q. Slim-neck by GSConv: A lightweight-design for real-time detector architectures. J. Real-Time Image Process. 2024, 21, 62. [Google Scholar] [CrossRef]
Li, K.; Geng, Q.; Wan, M.; Cao, X.; Zhou, Z. Context and spatial feature calibration for real-time semantic segmentation. IEEE Trans. Image Process. 2023, 32, 5465–5477. [Google Scholar] [CrossRef]
Li, H. Rethinking Features-Fused-Pyramid-Neck for Object Detection. In Computer Vision—ECCV 2024; Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G., Eds.; Springer: Cham, Switzerland, 2024; pp. 74–90. [Google Scholar]
Li, W.; Guo, H.; Liu, X.; Liang, K.; Hu, J.; Ma, Z.; Guo, J. Efficient face super-resolution via wavelet-based feature enhancement network. In Proceedings of the 32nd ACM International Conference on Multimedia, Melbourne, VIC, Australia, 28 October–1 November 2024; pp. 4515–4523. [Google Scholar]
Chen, Z.; He, Z.; Lu, Z.M. DEA-Net: Single image dehazing based on detail-enhanced convolution and content-guided attention. IEEE Trans. Image Process. 2024, 33, 1002–1015. [Google Scholar] [CrossRef]
Yan, Q.; Feng, Y.; Zhang, C.; Wang, P.; Wu, P.; Dong, W.; Sun, J.; Zhang, Y. You only need one color space: An efficient network for low-light image enhancement. arXiv 2024, arXiv:2402.05809. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 2015, 28, 91–99. [Google Scholar] [CrossRef] [PubMed]
Jocher, G. Ultralytics YOLOv5. 2020. Available online: https://github.com/ultralytics/yolov5 (accessed on 12 December 2025).
Jocher, G.; Chaurasia, A.; Qiu, J. Ultralytics YOLOv8. 2023. Available online: https://docs.ultralytics.com/models/yolov8/ (accessed on 12 December 2025).
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. YOLOv10: Real-time end-to-end object detection. Adv. Neural Inf. Process. Syst. 2024, 37, 107984–108011. [Google Scholar]
Jocher, G.; Qiu, J. Ultralytics YOLO11. 2024. Available online: https://docs.ultralytics.com/models/yolo11/ (accessed on 12 December 2025).
Tian, Y.; Ye, Q.; Doermann, D. YOLOv12: Attention-centric real-time object detectors. arXiv 2025, arXiv:2502.12524. [Google Scholar]
Zhang, H.; Li, F.; Liu, S.; Zhang, L.; Su, H.; Zhu, J.; Ni, L.M.; Shum, H.Y. DINO: DETR with improved denoising anchor boxes for end-to-end object detection. arXiv 2022, arXiv:2203.03605. [Google Scholar]
Bao, Y.; Song, K.; Liu, J.; Wang, Y.; Yan, Y.; Yu, H.; Li, X. Triplet-Graph Reasoning Network for Few-Shot Metal Generic Surface Defect Segmentation. IEEE Trans. Instrum. Meas. 2021, 70, 5004611. [Google Scholar] [CrossRef]

Figure 1. The overall architecture of HAMD-DETR. The framework integrates ADMPNet for multi-scale feature extraction, HDFPN for hierarchical feature fusion, and DFDEncoder for enhanced feature encoding.

Figure 2. Architecture of ADMPNet backbone network showing progressive multi-scale feature learning through ADMBlock modules. P1–P5 represent feature layers at different stages of the backbone network.

Figure 3. Structure of ADMBlock featuring dual-branch residual learning with MDIFusion and AGFUnit components.

Figure 4. Architecture of HDFPN showing hierarchical feature fusion through PACE, GRM, Adaptive Multi-scale Fusion Block and Dynamic Adaptive Interpolation Fusion.

Figure 5. Structure of Gradient Refinement Module (GRM) featuring iterative channel–spatial feature enhancement.

Figure 6. Architecture of Adaptive Multi-scale Fusion Block with semantic-guided feature weighting.

Figure 7. Structure of Dynamic Adaptive Interpolation Fusion, enabling learnable cross-level fusion.

Figure 8. Architecture of DFDEncoder incorporating FDFNetworks, DAFE, and Dynamic Tanh components.

Figure 9. Data augmentation examples showing original images and their transformations under simulated rain, snow, and sandstorm conditions to enhance model robustness.

Figure 10. Distribution of defect categories in the dataset. (a) Number of defect types in the dataset; (b) Distribution of defect proportions.

Figure 11. Performance comparison of different kernel size configurations in ADMBlock.

Figure 12. Comprehensive performance analysis of different module combinations in HAMD-DETR framework. (a) Detection accuracy metrics; (b) Model parameters and computational complexity.

Figure 13. Efficiency analysis comparing ADMPNet with other backbone networks in terms of GFLOPs and mAP50 performance. (a) Detection accuracy metrics; (b) Performance comparison of different backbone networks in terms of detection accuracy and computational efficiency.

Figure 14. Comparative analysis of HDFPN against other neck network architectures. (a) Detection accuracy metrics; (b) Performance comparison of different backbone networks in terms of detection accuracy and computational efficiency.

Figure 15. Performance–efficiency trade-off analysis of HAMD-DETR compared to mainstream detection models. (a) Detection accuracy metrics; (b) Performance comparison of different backbone networks in terms of detection accuracy and computational efficiency.

Figure 16. Qualitative comparison of detection results showing HAMD-DETR’s performance in (a) normal conditions, (b) rainy conditions, (c) snowy conditions, and (d) sandstorm conditions.

Figure 17. Attention map visualization demonstrating HAMD-DETR’s feature focusing capability on (a) varied defect shapes, (b) multi-category defects, and (c) weak target features.

Table 1. Ablation study of kernel sizes in ADMBlock.

Kernel Config	$B_{square}$	$B_{horizontal}$	$B_{vertical}$	P (%)	R (%)	mAP50 (%)	mAP50-95 (%)
K = 1	$1 \times 1$	$1 \times 5$	$5 \times 1$	64.6	53.3	54.2	24.7
K = 3	$3 \times 3$	$1 \times 11$	$11 \times 1$	72.8	56.8	56.7	28.9
K = 5 (ours)	$5 \times 5$	$1 \times 17$	$17 \times 1$	70.3	58.7	57.2	30.0
K = 7	$7 \times 7$	$1 \times 23$	$23 \times 1$	70.8	53.8	54.7	28.3

Table 2. Ablation study of HAMD-DETR framework components.

Model	ADMPNet	HDFPN	DFDEncoder	GFLOPs	Params (M)	P (%)	R (%)	mAP50 (%)	mAP50-95 (%)
1. Base				57.0	19.88	70.2	53.8	55.5	29.6
2	✓			43.6	13.21	70.3	58.7	57.2	30.0
3		✓		48.2	19.23	73.8	56.2	57.5	30.2
4			✓	57.2	20.01	73.2	55.3	57.4	31.0
5	✓	✓		41.3	13.78	68.6	59.8	58.4	30.4
6	✓		✓	43.8	13.35	67.4	58.9	58.2	30.0
7		✓	✓	48.4	19.37	76.5	54.6	58.0	31.0
8	✓	✓	✓	41.5	13.91	72.7	55.4	58.6	31.7

Table 3. Performance comparison of different backbone networks in wind turbine defect detection.

Model	GFLOPs	Params (M)	P (%)	R (%)	mAP50 (%)	mAP50-95 (%)
ResNet (base)	57.0	19.88	70.2	53.8	55.5	29.6
EfficientFormerv2	29.8	11.80	68.1	56.4	55.4	28.1
EfficientViT	27.3	10.71	69.4	55.0	55.0	26.6
RepViT	38.3	13.31	66.2	56.0	54.5	26.6
SMAFormer	61.4	28.70	71.0	56.9	56.6	27.1
FasterNet	28.5	10.81	68.3	56.4	54.9	27.3
StarNet	31.8	11.99	70.1	55.7	53.9	27.4
ADMPNet (ours)	43.6	13.21	70.3	58.7	57.2	30.0

Table 4. Performance comparison of different neck networks on wind turbine defect detection.

Model	GFLOPs	Params (M)	P (%)	R (%)	mAP50 (%)	mAP50-95 (%)
CCFF (base)	57.0	19.88	70.2	53.8	55.5	29.6
Slimneck	53.3	19.30	71.5	54.3	54.9	29.8
CSFCN	65.5	21.11	71.4	55.5	56.2	26.1
RFPN	64.6	20.20	71.9	56.4	55.9	29.3
WFU	63.3	23.43	71.1	55.9	56.1	27.2
CGAFusion	59.2	20.43	70.6	53.9	55.8	27.5
HVI	61.0	20.70	72.5	56.7	56.3	28.8
HDFPN (ours)	48.2	19.23	73.8	56.3	57.4	30.3

Table 5. Comprehensive performance comparison with mainstream object detection models.

Model	GFLOPs	Params (M)	P (%)	R (%)	mAP50 (%)	mAP50-95 (%)	FPS
Faster R-CNN	134.2	28.31	63.4	46.8	49.0	21.5	63.8
YOLOv5m	47.9	20.87	66.7	52.6	54.8	27.4	222.3
YOLOv8m	78.7	25.84	71.5	50.4	54.7	30.7	243.5
YOLOv10m	58.9	15.31	71.5	50.4	49.7	26.7	257.9
YOLOv11m	67.7	20.03	68.0	49.9	53.6	29.8	231.0
YOLOv12m	59.5	19.58	66.9	50.2	53.7	27.3	149.7
DINO	179.0	47.55	64.3	52.5	52.1	22.3	32.5
Deformable DETR	123.0	40.10	67.5	56.1	55.4	24.3	44.1
Conditional DETR	64.0	43.45	68.2	52.4	52.5	23.7	51.7
DAB-DETR	65.0	43.70	64.4	55.2	50.1	21.4	51.7
RT-DETR R18 (base)	57.0	19.88	70.2	53.8	55.5	29.6	188.2
RT-DETR R34	89.1	31.22	72.0	56.8	56.2	29.5	119.1
RT-DETR R50	129.9	42.11	74.1	55.4	55.6	31.1	115.0
HAMD-DETR (ours)	41.5	13.91	72.7	55.4	58.6	31.7	151.9

Table 6. Generalization performance comparison on NEU-DET dataset.

Model	GFLOPs	Params (M)	P (%)	R (%)	mAP50 (%)	mAP50-95 (%)	FPS
RT-DETR R18 (base)	57.0	19.88	69.2	68.0	70.6	39.1	172.0
HAMD-DETR (ours)	41.5	13.91	76.4	65.2	73.2	40.1	148.1

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Tian, S.; Zhang, P.; Liu, L. HAMD-DETR: A Wind Turbine Defect Detection Method Integrating Multi-Scale Feature Perception. Energies 2026, 19, 1235. https://doi.org/10.3390/en19051235

AMA Style

Tian S, Zhang P, Liu L. HAMD-DETR: A Wind Turbine Defect Detection Method Integrating Multi-Scale Feature Perception. Energies. 2026; 19(5):1235. https://doi.org/10.3390/en19051235

Chicago/Turabian Style

Tian, Shuhao, Pengpeng Zhang, and Lin Liu. 2026. "HAMD-DETR: A Wind Turbine Defect Detection Method Integrating Multi-Scale Feature Perception" Energies 19, no. 5: 1235. https://doi.org/10.3390/en19051235

APA Style

Tian, S., Zhang, P., & Liu, L. (2026). HAMD-DETR: A Wind Turbine Defect Detection Method Integrating Multi-Scale Feature Perception. Energies, 19(5), 1235. https://doi.org/10.3390/en19051235

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

HAMD-DETR: A Wind Turbine Defect Detection Method Integrating Multi-Scale Feature Perception

Abstract

1. Introduction

2. Related Work

2.1. Transformer-Based Object Detection Framework

2.2. Neck Network Architectures

3. Methods

3.1. Adaptive Dynamic Multi-Scale Perception Network (ADMPNet)

3.2. Hierarchical Dynamic Feature Pyramid Network (HDFPN)

3.3. Dynamic Frequency-Domain Feature Encoder (DFDEncoder)

4. Experimental Analysis

4.1. Dataset

4.2. Experimental Parameters

4.3. Evaluation Metrics

4.4. Ablation Experiments

4.4.1. Ablation Analysis of ADMPNet

4.4.2. Full-Model Ablation Analysis

4.5. Comparative Experiments

4.5.1. Backbone Network Performance Analysis

4.5.2. Neck Network Performance Analysis

4.5.3. Comparison with Mainstream Object Detection Methods

4.6. Visual Analysis

4.7. Generalization Validation

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI