Optimized RFE-YOLO Method for Identifying Defects in Wind Turbine Blades

Bai, Hua; Dong, Wei; Wu, Yanwei

doi:10.3390/app16105070

Open AccessArticle

Optimized RFE-YOLO Method for Identifying Defects in Wind Turbine Blades

by

Hua Bai

^1,*,

Wei Dong

^1,2 and

Yanwei Wu

^1,2

¹

Tianjin Key Laboratory of Optoelectronic Detection Technology and Systems, Tiangong University, Tianjin 300387, China

²

School of Electrical Engineering, Tiangong University, Tianjin 300387, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(10), 5070; https://doi.org/10.3390/app16105070

Submission received: 24 April 2026 / Revised: 12 May 2026 / Accepted: 14 May 2026 / Published: 19 May 2026

(This article belongs to the Section Electrical, Electronics and Communications Engineering)

Download

Browse Figures

Versions Notes

Abstract

Wind turbine blade defect detection requires accurate identification of small and irregular defects while maintaining low computational cost for practical inspection scenarios. However, lightweight detectors often suffer from insufficient local feature extraction, limited multiscale feature fusion, and weak responses to critical defect regions. To address these issues, this study proposes a Receptive-Field-Enhanced You Only Look Once model (RFE-YOLO), a lightweight defect detection model based on You Only Look Once version 10 nano (YOLOv10n).The proposed model introduces three task-oriented improvements. First, C2f-RFAConv is embedded into the backbone to enhance receptive field aware local feature representation for fine grained defects. Second, a Compact Cross-scale Feature Fusion Module, termed CCFM, is designed in the neck to improve the integration of low-level detail information and high-level semantic features with reduced computational complexity. Third, an Efficient Local Attention module is inserted before the detection head to strengthen defect-related spatial responses after feature fusion. Experiments were conducted on a wind turbine blade defect dataset containing three categories, namely Crack, Oil leakage, and Peel. The results show that RFE-YOLO achieves 89.9% mean Average Precision at an Intersection over Union threshold of 0.5, namely mAP@0.5, and 64.73% mAP@0.5:0.95. Compared with YOLOv10n, RFE-YOLO improves mAP@0.5 by 2.8 percentage points while reducing the number of parameters from 2.70M to 1.91M and giga floating point operations from 8.4 to 5.3. The inference speed reaches 88.8 frames per second on an NVIDIA GeForce RTX 3090 GPU. These results indicate that RFE-YOLO achieves a favorable balance between detection accuracy and model efficiency under the current experimental setting.

Keywords:

wind turbine blade defect detection; lightweight object detection; YOLOv10n; multi-scale feature fusion; attention mechanism

1. Introduction

For a long time, fossil energy has occupied a dominant position in the global power supply system. However, with the increasing depletion of conventional energy resources and the growing severity of climate change, the transition of the energy structure toward renewable energy has become a major global development trend. As a clean and sustainable renewable energy source, wind energy plays an important role in expanding installed power capacity, reducing greenhouse gas emissions, and promoting the low carbon transformation of the energy sector [1]. Wind turbines usually operate in complex environments, such as high altitude, coastal, and mountainous areas, where they are continuously exposed to wind blown sand, rain and snow, lightning strikes, and cyclic alternating loads. Under these conditions, typical blade surface defects, including Crack, Oil leakage, and Peel, are likely to occur. These defects not only degrade the aerodynamic performance of the blades and reduce wind energy conversion efficiency, but may also threaten structural integrity in severe cases, thereby adversely affecting the safe and stable operation of wind turbines. Therefore, research on efficient and accurate defect detection for wind turbine blades is of great engineering significance for ensuring operational reliability, improving power generation efficiency, and reducing maintenance costs [2].

At present, various methods have been proposed for wind turbine blade defect detection, including traditional inspection techniques such as vibration analysis, ultrasonic testing, and acoustic emission. However, these methods are often sensitive to environmental interference, rely on complex signal processing, or require extensive sensor deployment, which limits their practical application in complex inspection scenarios [3,4,5]. With the development of data-driven approaches, machine learning and deep learning have gradually become important research directions in this field. Early studies mainly focused on damage identification based on sensor signals. For example, Gaussian process regression was used to monitor changes in blade frequency for anomaly detection [6], logistic regression and support vector machines were employed to classify acoustic signals for damage identification [7], and multilayer perceptron models were adopted for defect detection [8]. Nevertheless, the robustness and generalization ability of these methods remain limited in complex environments. With the advancement of unmanned aerial vehicle inspection technology, vision-based deep learning methods have gradually emerged as an important technical route for wind turbine blade defect detection because of their noncontact operation, high efficiency, and intuitive detection results [9]. In addition to visual inspection methods, fault detection and isolation strategies have also been widely studied for wind turbine condition monitoring. For example, Borja-Jaimes et al. [10] proposed a sliding mode observer-based fault detection and isolation approach for a wind turbine benchmark, focusing on the diagnosis of pitch and drive train faults under model uncertainties and external disturbances. They further developed an actuator fault detection and isolation scheme using sliding mode observers for a wind turbine benchmark, providing an observer-based framework for identifying actuator-related faults [11]. These studies demonstrate the importance of intelligent fault diagnosis in wind energy systems from the perspective of system dynamics and control. However, observer-based methods mainly focus on internal component or actuator faults, whereas the present study addresses surface defect detection of wind turbine blades using images captured by unmanned aerial vehicles. Therefore, vision-based lightweight object detection can be regarded as a complementary route for blade surface condition monitoring.

Existing visual object detection methods can mainly be divided into two-stage and one-stage approaches. Representative two-stage methods include Faster Region-based Convolutional Neural Network (Faster R-CNN) [12] and Mask Region-based Convolutional Neural Network (Mask R-CNN) [13], which generally achieve relatively high detection accuracy. In related studies, Shihavuddin et al. introduced InceptionResNet-V2 into Faster R-CNN and combined it with a multiscale strategy, achieving an average detection accuracy of 81.10% [14]. Diaz et al. proposed the Cascade Mask R-DSCNN model, which achieved effective detection of multiple types of damage through multiscale feature fusion [15]. However, two-stage methods usually involve complex network architectures and slow inference speed, making them unsuitable for real-time scenarios such as inspection using unmanned aerial vehicles inspection. In contrast, one-stage methods, represented by Single Shot MultiBox Detector (SSD) [16] and the You Only Look Once (YOLO) series, offer a better balance between detection accuracy and real-time performance and are therefore more suitable for practical engineering applications.

Existing studies have shown that the performance of wind turbine blade defect detection can be improved to a certain extent by introducing lightweight backbone networks or enhancing feature fusion structures. For instance, Zhao et al. proposed the SN-CA-SSD model, which replaced VGG-16 with ShuffleNetv2 and introduced a coordinate attention mechanism to improve small-object detection capability [17]. Lv et al. improved feature representation by combining an enhanced ResNet with a Bidirectional Feature Pyramid Network (BiFPN) [18]. Cao et al. proposed a cross-layer feature fusion network and incorporated an Inception module, improving the detection accuracy for small targets by 2.39% compared with YOLOv3-Tiny [19]. Yan et al. optimized anchor boxes using K-means++ and combined them with BiFPN to improve YOLOv4, achieving an average crack detection accuracy of 93.49% [20]. Zhang et al. proposed a lightweight YOLOv5s model integrating MobileNetv3, the Convolutional Block Attention Module (CBAM), and the Extended Intersection over Union (EIOU) loss function, resulting in a 5.51% increase in mean Average Precision (mAP) [21]. Fu et al. proposed the LE-YOLO model, which achieved 78.7% mAP@0.5 in low-resolution image scenarios [22]. Liu et al. improved YOLOv8, increasing mAP@0.5 by 3.0% while reducing model complexity [23].

Although lightweight YOLO-based methods have improved the efficiency of wind turbine blade defect detection, several limitations remain unresolved. Many existing methods mainly focus on a single type of modification, such as replacing the backbone network, introducing an attention module, or improving the feature pyramid structure. Backbone replacement can reduce the number of parameters, but it may weaken the representation of fine-grained defect details. Attention modules can enhance feature responses, but they are often applied at a single stage and may not fully address multiscale feature interaction. Feature pyramid structures such as PAN or BiFPN can improve multiscale aggregation, but their compatibility with lightweight YOLO neck structures is not always specifically optimized for wind turbine blade defects.

In practical blade inspection scenarios, these limitations become more evident. First, small defects in the Crack category usually occupy only a limited number of pixels and often show weak texture contrast against the blade surface, which makes them difficult for lightweight detectors to represent accurately. Second, blade defects exhibit clear scale variation. Crack defects are usually thin and elongated, whereas Peel and Oil leakage defects may cover larger or more irregular regions. Therefore, the model needs to effectively integrate low-level edge details and high-level semantic information. Third, improving detection accuracy by adding attention or fusion modules may introduce extra computational cost, which is unfavorable for inspection using unmanned aerial vehicles under restricted onboard computing resources. Fourth, even after feature fusion, critical defect responses may still be weakened by complex backgrounds, illumination changes, and surface textures. Therefore, an effective lightweight detector should simultaneously strengthen fine-grained local feature extraction, improve compact cross-scale feature fusion, enhance defect-related feature refinement, and maintain low computational cost. These unresolved issues motivate the coordinated design of C2f-RFAConv, CCFM, and ELA in the proposed RFE-YOLO.

To address the above problems, this paper proposes a lightweight defect detection algorithm, termed RFE-YOLO, based on YOLOv10n. Unlike existing lightweight YOLO-based defect detection methods that mainly focus on a single architectural component, such as backbone replacement, attention enhancement, or feature pyramid modification, RFE-YOLO performs a coordinated redesign of YOLOv10n from feature extraction, feature fusion, and feature refinement. Specifically, a C2f-RFAConv module, which combines the C2f structure with Receptive Field Attention Convolution, is introduced into the backbone network to enhance receptive-field-aware local detail feature extraction for fine-grained blade defects, especially Crack defects. A Compact Cross-scale Feature Fusion Module, termed CCFM, is designed by adapting the convolutional cross-scale fusion concept of RT-DETR to the FPN and PAN style neck of YOLOv10n, thereby improving multiscale feature integration while reducing model complexity. In addition, an Efficient Local Attention module, termed ELA, is inserted before the detection head to strengthen defect-related spatial responses after feature fusion. Through these coordinated improvements, the proposed model enhances the detection performance of wind turbine blade defects while preserving the advantages of lightweight design.

The main contributions of this study are summarized as follows:

A lightweight RFE-YOLO framework is developed for wind turbine blade defect detection. The proposed framework performs coordinated structural optimization of YOLOv10n at the backbone, neck, and feature refinement stages before the detection head.
A C2f-RFAConv backbone module is introduced to replace part of the conventional convolutional operation in the C2f bottleneck. This design enables receptive-field-aware local feature modeling and improves the representation of fine-grained defects such as Crack defects.
A Compact Cross-scale Feature Fusion Module, termed CCFM, is designed by adapting the CCFF concept of RT-DETR to the FPN and PAN style neck of YOLOv10n. This modification improves the compatibility of cross-scale feature fusion with the lightweight YOLOv10n framework and enhances the integration of low-level defect edge information and high-level semantic features, which is beneficial for representing Crack, Oil leakage, and Peel defects with different scales and appearances.
An ELA-based feature refinement strategy is incorporated before the detection head to enhance defect-sensitive spatial responses before prediction. This strategy improves the detection of small and irregular blade defects with limited computational overhead.
Extensive ablation and comparative experiments demonstrate that the three components provide complementary improvements. The proposed RFE-YOLO achieves a better balance among detection accuracy, model size, computational complexity, and inference speed.

2. Method

2.1. The Architecture of YOLOv10n

This section introduces the YOLOv10n network architecture and provides the theoretical basis for the subsequent model improvements. YOLOv10n consists of four main components: the input layer, the backbone network, the neck network, and the detection head, as shown in Figure 1. The input layer is responsible for image resizing, normalization, and data adaptation. The backbone network extracts hierarchical visual features from the input image and generates feature maps containing both semantic information and spatial details. The neck network then performs multiscale feature fusion to strengthen the interaction between high-level semantic information and low-level spatial information. Finally, the detection head predicts the location and category of the targets based on the fused feature maps, completing bounding box regression and category prediction. In Figure 1, the backbone mainly contains Conv, C2f, Spatial-Channel Decoupled Downsampling (SCDown), Spatial Pyramid Pooling Fast (SPPF), and Partial Self-Attention (PSA) modules. The neck uses an FPN and PAN style feature fusion structure, and the head contains one-to-many and one-to-one detection branches.

YOLOv10n was selected as the baseline model in this study for the following reasons. First, as the nano-scale version of YOLOv10, YOLOv10n has a compact architecture with a relatively low parameter count and computational cost, which is consistent with the lightweight requirement of unmanned aerial vehicle-based wind turbine blade inspection. Second, YOLOv10 introduces a dual label assignment strategy that combines one-to-many and one-to-one assignments, reducing the dependence on postprocessing and providing a suitable basis for real-time object detection. Third, the backbone, neck, and head structure of YOLOv10n contains representative lightweight components such as C2f, SCDown, SPPF, and PSA, which makes it suitable for analyzing targeted improvements in feature extraction, feature fusion, and feature refinement. To account for recent advances in lightweight object detection, YOLOv11, YOLOv12, and YOLOv13 were included in the comparative experiments [24,25,26]. Therefore, YOLOv10n was used as the structural baseline for model improvement, while newer YOLO variants were used as external comparison methods to evaluate the competitiveness of the proposed RFE-YOLO.

YOLOv10n adopts a default input resolution of 640 × 640 and contains several lightweight modules [27]. SCDown supports efficient downsampling [28], SPPF reduces computational overhead while preserving spatial pyramid pooling capability [29], and PSA improves feature modeling efficiency through partial self-attention. In the neck network, YOLOv10n combines Feature Pyramid Network (FPN), Path Aggregation Network (PAN), and PANet style feature fusion to enhance multiscale representation [30,31,32]. In addition, its dual label assignment strategy improves supervision quality and supports efficient detection [33].

In the following sections, tensors are represented in the form of

X \in ℝ^{C \times H \times W}

, where

C

,

H

, and

W

denote the number of channels, height, and width, respectively. The batch dimension is omitted for simplicity. The symbol

⊙

denotes element wise multiplication,

[\cdot, \cdot]

denotes channel concatenation after spatial resolution alignment, and

{Conv}_{k \times k}

,

G N

,

B N

, and

σ (\cdot)

denote convolution, Group Normalization, Batch Normalization, and the Sigmoid activation function, respectively.

2.2. RFAConv

In traditional convolutional neural networks, convolution kernel parameters are shared across different spatial locations. This parameter sharing strategy improves computational efficiency, but it may limit the ability of the model to capture position-sensitive local patterns, especially when defect regions are small, weakly contrasted, or partially discontinuous. Although spatial attention mechanisms can alleviate this problem to some extent [34], they do not fully address the representational limitation caused by fixed convolutional aggregation within local receptive fields. To improve local feature modeling, Zhang et al. proposed Receptive Field Attention Convolution (RFAConv), which introduces an adaptive weighting strategy inside each receptive field [35]. The structure of RFAConv is shown in Figure 2, where the input feature map is first transformed into receptive-field spatial features and receptive-field attention weights, and the weighted features are then rearranged and aggregated by convolution.

Let

X \in ℝ^{C \times H \times W}

denote the input feature map, where

C

,

H

, and

W

represent the number of channels, height, and width, respectively. Let k denote the convolution kernel size. In RFAConv, the receptive field attention operation can be formulated as:

F = A_{r f} ⊙ F_{r f}

(1)

where

A_{r f}

denotes the receptive field attention map,

F_{r f}

denotes the transformed receptive field spatial feature, and

⊙

denotes element wise multiplication. Specifically, these two terms are computed as

\begin{array}{l} A_{r f} = Softmax (g^{1 \times 1} (AvgPool (X))) \\ F_{r f} = ReLU (Norm (g^{k \times k} (X))) \end{array}

(2)

where

g^{i \times i} (\cdot)

denotes a grouped convolution with kernel size

i \times i

,

AvgPool (\cdot)

aggregates information within each receptive field,

Norm (\cdot)

denotes the normalization operation, and

Softmax (\cdot)

is used to emphasize important features within each receptive field. After the attention weighting, the feature map is rearranged by an Adjust Shape operation. A

k \times k

convolution with stride

k

is then applied to aggregate the weighted receptive field features:

Y = {Conv}_{k \times k, s = k} (AdjustShape (F))

(3)

where

Y

denotes the output feature map of RFAConv.

RFAConv introduces a receptive field-oriented attention mechanism to reduce the limitations caused by fixed parameter sharing in standard convolution. In a standard convolution operation, the same kernel weights are applied to different spatial locations, and the features within each receptive field are aggregated without explicitly distinguishing defect-related responses from background responses. This mechanism is computationally efficient, but it may weaken small and low contrast defect features during local aggregation.

In RFAConv, the receptive field attention map

A_{r f}

adaptively modulates the transformed receptive field spatial feature

F_{r f}

through element wise multiplication, as expressed by

F = A_{r f} ⊙ F_{r f}

. From this perspective,

A_{r f}

adjusts the effective contribution of each feature within the receptive field before the final convolutional aggregation. Features related to defect edges and local texture variations can receive stronger responses, whereas redundant background information can be suppressed. After feature rearrangement, a

k \times k

convolution with stride

k

is applied to aggregate the weighted receptive field features efficiently.

This mechanism is particularly suitable for fine grained wind turbine blade defects. Crack defects are usually thin, discontinuous, and weakly contrasted against the blade surface. If only fixed convolutional weights are used, these subtle defect signals may be diluted by surrounding blade textures. By introducing receptive-field adaptive weighting, RFAConv enhances the relative contribution of defect-sensitive local features and improves the backbone network’s ability to preserve discriminative details with limited computational overhead.

In this paper, the second convolutional layer in the C2f bottleneck is replaced with RFAConv to construct the C2f-RFAConv module, which is embedded into the backbone network. By preserving the original multi-branch structure and residual connections, the module introduces attention-guided dynamic receptive-field modeling to enhance local feature extraction without substantially increasing computational overhead. As a result, the backbone produces more discriminative representations, improving the perception of small-scale and multi-scale defects in wind turbine blade images. The structure of C2f-RFAConv is shown in Figure 3.

2.3. CCFM

In recent years, Transformer-based object detection methods have improved detection accuracy and convergence speed by introducing multiscale feature encoders. However, multiscale feature modeling substantially increases sequence length and computational complexity. Even with efficient mechanisms such as deformable attention, the encoder remains a major computational bottleneck. To alleviate this issue, RT-DETR [36] decouples intra-scale interaction from cross-scale fusion. In its efficient hybrid encoder, Attention-based Intra-scale Feature Interaction (AIFI) is used to process high-level semantic features, while CNN-based Cross-scale Feature Fusion (CCFF) is used to fuse multiscale convolutional features with reduced computational cost. Figure 4 illustrates the original CCFF structure, which performs feature fusion through channel alignment, convolutional fusion, and residual connections.

Let

S_{3}

,

S_{4}

, and

S_{5}

denote the feature maps generated by the last three stages of the backbone, where

S_{3}

has the highest spatial resolution and

S_{5}

has the strongest semantic representation. The proposed CCFM first performs top-down feature fusion and then conducts bottom-up feature aggregation. For clarity,

C (\cdot)

,

U (\cdot)

,

D (\cdot)

,

B (\cdot)

, and

H (\cdot)

denote convolution, upsampling, SCDown, C2f, and C2fCIB operations, respectively. The symbol

[\cdot, \cdot]

denotes channel concatenation after spatial resolution alignment. The top-down fusion process is formulated as:

\begin{array}{l} P_{5} = C_{5} (S_{5}) \\ P_{4} = B_{4} ([C_{4} (S_{4}), U (P_{5})]) \\ P_{3} = B_{3} ([C_{3} (S_{3}), U (P_{4})]) \end{array}

(4)

where

P_{5}

,

P_{4}

, and

P_{3}

represent the progressively fused multiscale features. The upsampling operation

U (\cdot)

is used to align the spatial resolution of high-level features with that of the corresponding low-level features before concatenation.

The bottom-up aggregation process is expressed as:

\begin{array}{l} N_{4} = B_{4^{'}} ([C_{d} (P_{3}), P_{4}]) \\ N_{5} = H ([D (N_{4}), P_{5}]) \end{array}

(5)

where

C_{d} (\cdot)

denotes a downsampling convolution,

D (\cdot)

denotes the SCDown operation, and

H (\cdot)

denotes the C2fCIB-based fusion operation. Finally, the output of CCFM is given by:

O = \{P_{3}, N_{4}, N_{5}\}

(6)

Here,

P_{3}

,

N_{4}

, and

N_{5}

are used as the multiscale output features for the subsequent detection head.

The original CCFF module in RT-DETR is designed for the efficient hybrid encoder of a Transformer-based detector. However, the feature flow of RT-DETR is different from the Feature Pyramid Network (FPN) and Path Aggregation Network (PAN) style neck structure of YOLOv10n. Therefore, directly applying the original CCFF structure to YOLOv10n may not fully match the feature propagation pattern of the YOLO neck.

To address this issue, this study redesigns the cross-scale fusion strategy inspired by CCFF according to the neck structure of YOLOv10n and proposes CCFM. As shown in Figure 5, CCFM takes three multiscale feature maps as inputs and follows a bidirectional feature fusion process. In the top-down path, high-level semantic features are progressively upsampled and concatenated with lower-level features to enhance fine-grained spatial details. The concatenated features are refined using lightweight C2f modules. In the bottom-up path, the fused high-resolution features are downsampled and aggregated with deeper features to strengthen semantic consistency. Conv, C2f, SCDown, and C2fCIB are used as the main lightweight fusion components. Specifically, Conv is used for channel adjustment and local feature transformation, C2f is used for feature refinement after concatenation, SCDown is used for efficient downsampling, and C2fCIB is used to enhance high-level feature aggregation before prediction. The final outputs of CCFM are three multiscale feature maps, which are sent to the subsequent detection head.

Compared with the original CNN-based Cross-scale Feature Fusion module in RT-DETR, the proposed Compact Cross-scale Feature Fusion Module introduces three structural changes to adapt the fusion process to the FPN and PAN style neck of YOLOv10n. First, the cross-scale fusion path is reorganized into a bidirectional top-down and bottom-up structure, so that high-level semantic features and low-level spatial details can be fused more effectively. Second, Conv, C2f, SCDown, and C2fCIB are used as lightweight fusion components instead of directly adopting the original CCFF block, which improves compatibility with the YOLOv10n neck and reduces redundant computation. Third, considering the scale variation and irregular appearance of wind turbine blade defects, CCFM strengthens the integration of low-level edge information and high-level semantic representations. This design is beneficial for representing Crack, Oil leakage, and Peel defects with different shapes, textures, and spatial scales. Therefore, CCFM inherits the efficient cross-scale fusion idea of CCFF while improving its adaptability to the lightweight YOLOv10n framework and the specific defect characteristics of wind turbine blade images.

2.4. ELA

After backbone feature extraction and neck-based multiscale fusion, the model obtains feature maps containing both semantic and spatial information. However, under lightweight design constraints, these fused features may still show weak responses to critical defect regions and limited sensitivity to fine-grained defects, which can degrade detection accuracy. This issue is particularly evident in wind turbine blade defect detection, where Crack defects are usually small and fine-grained, while Oil leakage and Peel defects may show irregular shapes and variable spatial distributions.

To address this problem, an Efficient Local Attention (ELA) module is inserted before the detection head to refine the fused features [37]. By modeling local spatial dependencies, ELA enhances responses to defect-related regions while suppressing redundant background information, thereby improving the detection of fine-grained and irregular defects. This refinement is achieved with limited computational overhead, making ELA suitable for lightweight detection frameworks. As shown in Figure 6, ELA generates attention weights along the horizontal and vertical directions using one-dimensional pooling, one-dimensional convolution, Group Normalization, and Sigmoid activation. These direction-aware attention weights are then applied to the input feature map to enhance spatially important regions.

ELA enhances critical region localization by modeling local spatial dependencies without introducing channel-dimension reduction. Unlike conventional attention mechanisms, it employs one-dimensional convolution and group normalization to refine features with low computational overhead. Let

X \in ℝ^{C \times H \times W}

denote the input feature map of ELA. The horizontal and vertical direction descriptors are obtained by one-dimensional global average pooling:

\begin{array}{l} z_{c}^{h} (h) = \frac{1}{W} \sum_{w = 1}^{W} X_{c} (h, w), h = 1, 2, \dots, H \\ z_{c}^{w} (w) = \frac{1}{H} \sum_{h = 1}^{H} X_{c} (h, w), w = 1, 2, \dots, W \end{array}

(7)

where

z^{h} \in ℝ^{C \times H \times 1}

and

z^{w} \in ℝ^{C \times 1 \times W}

encode spatial dependencies along the horizontal and vertical directions, respectively. Then, one-dimensional convolution, Group Normalization, and Sigmoid activation are applied to generate direction aware attention weights:

y^{h} = σ (G_{n} (F_{h} (z_{h})))

(8)

y^{w} = σ (G_{n} (F_{w} (z_{w})))

(9)

where

F_{h} (\cdot)

and

F_{w} (\cdot)

denote one-dimensional convolutions along the height and width directions, respectively. Finally, the enhanced feature map is computed as:

Y_{c} (h, w) = X_{c} (h, w) \cdot y_{c}^{h} (h) \cdot y_{c}^{w} (w)

(10)

or equivalently,

Y = X ⊙ y^{h} ⊙ y^{w}

(11)

where the attention weights

y^{h}

and

y^{w}

are broadcast to the same spatial size as

X

, and

⊙

denotes element-wise multiplication.

In the proposed model, ELA is inserted before the detection head to further refine the multiscale features fused by CCFM. By introducing lightweight attention modulation before prediction, ELA strengthens responses to defect-related regions with only marginal computational overhead, thereby compensating for the limited feature representation capacity of lightweight architectures. Together with C2f-RFAConv and CCFM, ELA forms a coordinated framework for feature extraction, feature fusion, and feature refinement, which improves wind turbine blade defect detection while preserving model lightweightness.

2.5. RFE-YOLO Network Structure

In summary, to address the limitations of lightweight models in fine-grained defect representation, multiscale feature fusion, and defect-related region response for wind turbine blade defect detection, this study proposes targeted improvements to the YOLOv10n network structure. The C2f-RFAConv module is introduced into the backbone network to enhance receptive-field-aware local feature extraction, which helps improve the representation of fine-grained defects such as Crack. In the neck network, the CCFM lightweight feature fusion structure is adopted to achieve efficient cross-scale feature integration with low computational overhead, thereby improving the utilization of features at different scales for Crack, Oil leakage, and Peel defects. In addition, the ELA mechanism is inserted before the detection head to further refine the fused features and strengthen the model response to defect-related regions.

The integration of C2f-RFAConv, CCFM, and ELA follows the hierarchical feature processing mechanism of YOLOv10n. C2f-RFAConv focuses on backbone feature extraction, CCFM focuses on neck feature fusion, and ELA focuses on feature refinement before prediction. Therefore, the three modules correspond to different stages of the detection pipeline rather than repeatedly enhancing the same feature representation. Through this stage-specific design, RFE-YOLO achieves complementary optimization in feature extraction, feature fusion, and feature refinement while maintaining low parameter count and computational complexity.

The overall structure of RFE-YOLO is shown in Figure 7. Compared with the original YOLOv10n structure shown in Figure 1, the proposed RFE-YOLO mainly differs in three parts. First, several original C2f modules in the backbone are replaced by C2f-RFAConv modules to enhance receptive-field-aware local detail feature extraction. Second, the original neck structure is redesigned using CCFM, which performs multiscale feature fusion through top-down and bottom-up feature interaction with Conv, C2f, SCDown, and C2fCIB components. Third, ELA is inserted before the detection head to refine the fused features and enhance defect-related spatial responses before prediction. The detection head then outputs the predicted defect categories and bounding boxes. Therefore, Figure 7 highlights the modified backbone, redesigned neck, and added feature refinement module of RFE-YOLO, rather than simply repeating the original YOLOv10n architecture.

3. Experiments

3.1. Datasets

The dataset used in this study was obtained from an open-source wind turbine blade defect dataset hosted on Roboflow Universe. Before data augmentation, the dataset contained 351 original images, including 124 Crack images, 113 Peel images, and 114 Oil leakage images. The defect categories considered in this study were Crack, Oil leakage, and Peel. All images were annotated in YOLO format, and the bounding boxes were checked to ensure that the annotated regions covered the corresponding defect areas as accurately as possible.

To improve the diversity of training samples and enhance the robustness of the model, data augmentation was applied to the dataset. The augmentation operations included geometric transformation, image scaling, flipping, Mosaic augmentation, and color perturbation. After augmentation, the dataset contained 2089 images. To avoid data leakage, the dataset was split at the original image level. Images generated from the same original image were assigned to the same subset, and augmented versions of one image were not allowed to appear across the training, validation, and test sets. This strategy ensured that the evaluation results were not affected by repeated augmented samples derived from the same original image.

The augmented dataset was divided into training, validation, and test sets according to a ratio of 7:2:1. The class distribution of defect instances in each subset was as follows. The training set contained 513 Oil leakage instances, 835 Peel instances, and 697 Crack instances. The validation set contained 120 Oil leakage instances, 246 Peel instances, and 198 Crack instances. The test set contained 79 Oil leakage instances, 119 Peel instances, and 95 Crack instances. Therefore, the dataset contained 712 Oil leakage instances, 1200 Peel instances, and 990 Crack instances in total.

In addition, annotation quality control was performed before model training. The bounding box annotations were visually inspected, and samples with ambiguous defect boundaries, incomplete boxes, or incorrect category labels were corrected. This process helped ensure the reliability of the dataset and the validity of the subsequent experimental evaluation.

3.2. Experimental Environment and Evaluation Metrics

All experiments were implemented using PyTorch 1.13.0 with CUDA 11.7. The training and inference experiments were conducted on a workstation equipped with an NVIDIA GeForce RTX 3090 GPU. To improve reproducibility, the random seed was fixed to 42 for dataset splitting, model initialization, and data loading in the main experiments. In the main comparison and ablation experiments, each model was trained once using the fixed random seed. Unless otherwise specified, all baseline models, comparison models, ablation variants, and the proposed RFE-YOLO were trained using the same dataset split, input resolution, optimizer, batch size, training epochs, and learning rate settings.

To further evaluate the robustness of the results, additional repeated experiments were conducted using random seeds 5 and 15 for selected representative models. The dataset split was kept unchanged, while the random seeds were varied for model initialization and data loading. Together with the main experimental result obtained using seed 42, these runs were used to calculate the mean and standard deviation reported in the robustness analysis.

No pretrained weights were used in this study. All models were trained from scratch on the wind turbine blade defect dataset using the same training protocol. This setting was adopted to ensure that the comparison reflected the effect of the network structures under identical training conditions.

To reduce the potential bias caused by repeated hyperparameter adjustment on the validation set, preliminary experiments were conducted only to determine a unified training configuration. After the configuration was fixed, it was applied consistently to all models. The test set was not used for hyperparameter selection or model selection and was reserved only for the final performance evaluation. The input image size was set to 640 × 640, and the batch size was set to 16. All models were trained for 300 epochs using the stochastic gradient descent optimizer. The initial learning rate was set to 0.01, and the final learning rate factor was set to 0.01. The learning rate was updated according to the default YOLOv10 training schedule. The momentum and weight decay were set to 0.937 and 0.0005, respectively. No early stopping strategy was used. The checkpoint with the best validation mAP@0.5 was selected for final testing. The detailed training hyperparameters are listed in Table 1.

The same data augmentation strategy was applied to all models during training to ensure a fair comparison. The augmentation operations included flipping, scale transformation, horizontal flipping, vertical flipping, and Mosaic augmentation. These operations were used to increase the diversity of training samples and improve the robustness of the model to changes in object scale, position, and image appearance. The validation and test sets were not augmented during evaluation. The detailed augmentation hyperparameters are shown in Table 2.

During inference, all models were evaluated using the same input resolution and the same hardware platform. FPS was measured on the NVIDIA GeForce RTX 3090 GPU under identical inference conditions, excluding data loading time. Therefore, the reported detection accuracy, parameter count, computational cost, and inference speed were obtained under a consistent experimental protocol.

Model performance was evaluated using Precision (P), Recall (R), Average Precision (AP), mean Average Precision (mAP), and the number of parameters. These metrics were used to quantify detection accuracy, missed detection tendency, class-level detection performance, and overall detection performance in the ablation and comparison experiments. Precision and Recall were calculated based on true positives (TP), false positives (FP), and false negatives (FN). Precision measures the proportion of correctly predicted positive defects among all predicted positive defects, whereas Recall reflects the proportion of correctly detected defects among all actual defects. AP is defined as the area under the precision recall curve for each category, and mAP represents the mean AP over all categories. In this study, mAP@0.5 was adopted, where the Intersection over Union (IoU) threshold was set to 0.5 when determining true positive predictions. The evaluation formulas used for the quantitative results in Table 3 and Table 4 are given as follows:

P = \frac{T}{T + F P}

(12)

R = \frac{T}{T + F N}

(13)

m A P = \frac{\sum_{n = 1}^{N} A P_{n}}{N}

(14)

where

N

denotes the number of defect categories.

4. Analysis of Experimental Results

4.1. Training Result Analysis Plot

Figure 8a presents the “Confusion Matrix Normalized” result of RFE-YOLO, while Figure 8b shows the “Precision-Recall” curves and AP values of the three defect categories. As shown in the figure, the AP values for Crack, Oil leakage, and Peel are 86.6%, 95.2%, and 87.9%, respectively, and the overall mAP@0.5 reaches 89.9%. These results indicate that the proposed model achieves stable detection performance across different defect categories. Figure 8b presents the confusion matrix of RFE-YOLO. Each column represents the predicted category, and each row corresponds to the actual category. The normalized values on the diagonal represent correctly classified samples, while the off diagonal values indicate misclassified samples. It can be observed that most defect instances are concentrated on the diagonal, suggesting that the proposed model can distinguish Crack, Oil leakage, and Peel defects with reliable classification performance.

4.2. Ablation Experiments

To further examine the individual and combined effects of the proposed modules, an ablation matrix is presented in Table 3. Compared with the baseline YOLOv10n, introducing C2f-RFAConv alone improves mAP@0.5 from 87.1% to 88.9%. The AP for Crack increases from 83.3% to 88.1%, indicating that receptive-field-aware feature extraction is beneficial for fine-grained defect representation. However, this module slightly increases the number of parameters and GFLOPs, and the FPS decreases from 56.2 to 54.1. This result suggests that C2f-RFAConv mainly contributes to accuracy improvement through enhanced local feature extraction.

When CCFM is used independently, the model achieves 86.5% mAP@0.5, with 1.98M parameters, 6.4 GFLOPs, and 75.6 FPS. Although the overall mAP is lower than that of the baseline, the model complexity is clearly reduced and the inference speed is improved. This indicates that CCFM is effective in reducing computational redundancy, but using it alone may weaken the representation of certain defect categories, especially Peel. When ELA is introduced independently, the model obtains 87.8% mAP@0.5 and improves the AP for Crack to 91.6%, which shows that local attention refinement can enhance spatial responses to fine defects. However, the AP for Peel decreases, suggesting that ELA alone cannot fully compensate for insufficient multiscale feature fusion.

The two-module combinations further reveal the complementary relationship among the proposed components. The combination of C2f-RFAConv and CCFM achieves 89.2% mAP@0.5, which is higher than the result obtained by using either module independently. This demonstrates that enhanced local feature extraction and compact cross-scale feature fusion can jointly improve detection performance. The combination of C2f-RFAConv and ELA achieves 88.5% mAP@0.5, while the combination of CCFM and ELA reaches 88.2% mAP@0.5 with only 1.96M parameters and 5.9 GFLOPs. These results indicate that the three modules act on different stages of the detection pipeline and provide complementary contributions to accuracy and efficiency.

When C2f-RFAConv, CCFM, and ELA are integrated, the complete RFE-YOLO model achieves the best mAP@0.5 of 89.9%. The AP values for Crack, Oil leakage, and Peel are 86.6%, 95.2%, and 87.9%, respectively. To further clarify the complexity effect of each module, the changes in parameters, GFLOPs, and FPS were analyzed relative to YOLOv10n. C2f-RFAConv alone slightly increases the parameters from 2.70M to 2.73M and reduces FPS from 56.2 to 54.1, indicating a small computational cost for enhanced local feature extraction. In contrast, CCFM alone reduces the parameters to 1.98M and GFLOPs to 6.4, while increasing FPS to 75.6, showing its contribution to lightweight feature fusion and computational redundancy reduction. ELA alone achieves 2.61M parameters, 8.0 GFLOPs, and 60.4 FPS, suggesting that it introduces limited computational overhead while improving feature refinement. When the three modules are combined, RFE-YOLO achieves 1.91M parameters, 5.3 GFLOPs, and 88.8 FPS. Compared with YOLOv10n, the complete model reduces the parameters by 0.79M, decreases GFLOPs by 3.1, and improves FPS by 32.6. Therefore, the performance improvement of RFE-YOLO is not mainly caused by increased architectural complexity, but is more likely attributed to the complementary roles of the three modules in feature extraction, feature fusion, and feature refinement.

4.3. Comparison Experiments

To evaluate the effectiveness of the proposed RFE-YOLO, several representative object detection models were selected for comparison, including SSD, Faster R-CNN, YOLOv5n, YOLOv7-tiny, YOLOv8n, YOLOv9-tiny, YOLOv10n, YOLOv11n, YOLOv12n, and YOLOv13n. The recent YOLO variants, including YOLOv11n [35], YOLOv12n [36], and YOLOv13n [37], were included to evaluate whether RFE-YOLO remains competitive against newer YOLO-based architectures. All comparison models followed the unified training and evaluation protocol described in Section 3.2, including the same dataset split, input resolution, optimizer, batch size, training epochs, and inference hardware. The parameters, GFLOPs, and FPS values were measured under the same experimental environment.

To provide a more comprehensive comparison with baseline models, mAP@0.5:0.95 and per-class AP values for Crack, Oil leakage, and Peel were included in the evaluation results, as shown in Table 4. Compared with YOLOv10n, the proposed RFE-YOLO improves mAP@0.5 from 87.1% to 89.9% and mAP@0.5:0.95 from 62.71% to 64.73%. Meanwhile, the number of parameters is reduced from 2.70M to 1.91M, GFLOPs decrease from 8.4 to 5.3, and the inference speed increases from 56.2 FPS to 88.8 FPS. These results indicate that RFE-YOLO improves detection accuracy while maintaining a compact model structure and high inference efficiency.

It can also be observed that RFE-YOLO does not achieve the best value for every single metric. For example, YOLOv5n obtains the highest Precision of 92.5%, while RFE-YOLO achieves 83.9%. However, YOLOv5n has lower Recall, mAP@0.5, and FPS than RFE-YOLO. This suggests that higher Precision alone does not necessarily indicate better overall detection performance, especially for practical inspection tasks where missed detections and inference efficiency should also be considered. YOLOv13n achieves a comparable mAP@0.5 of 87.7% and mAP@0.5:0.95 of 63.14%, but its FPS is only 39.6, which is much lower than that of RFE-YOLO. Therefore, the advantage of RFE-YOLO lies not only in achieving the highest mAP@0.5, but also in providing a better trade-off among detection accuracy, model complexity, and real-time inference speed.

For per-class performance, RFE-YOLO achieves AP values of 86.6%, 95.2%, and 87.9% for Crack, Oil leakage, and Peel, respectively. Compared with YOLOv10n, the AP for Crack increases from 83.3% to 86.6%, and the AP for Peel increases from 82.1% to 87.9%. Although the AP for Oil leakage is slightly lower than that of YOLOv10n, it remains at a high level of 95.2%. These results show that the proposed model improves the detection of challenging defect categories while preserving strong performance for relatively easier categories.

To further examine the stability of the proposed method, repeated experiments were conducted using three random seeds, namely 42, 5, and 15. The dataset split was kept unchanged, while the random seeds were varied for model initialization and data loading. Considering the computational cost of repeating all comparison models, three representative models with competitive performance or comparable complexity, namely YOLOv5n, YOLOv10n, and YOLOv13n, were selected for robustness analysis. The results are reported as mean ± standard deviation in Table 5.

The repeated experiments show that RFE-YOLO consistently achieves higher mAP@0.5 than the selected comparison models under all three random seeds. The mean mAP@0.5 of RFE-YOLO is 90.1% with a standard deviation of 0.3%, which is higher than those of YOLOv5n, YOLOv10n, and YOLOv13n. In terms of mAP@0.5:0.95, RFE-YOLO also achieves the highest mean value of 64.9% with a standard deviation of 0.2%. Although the number of repeated runs is limited and a more extensive statistical significance test was not conducted, the consistently higher mean performance and small standard deviation indicate that the improvement of RFE-YOLO is stable under different random initialization conditions. These results suggest that the reported gain is not caused by a single random run.

Figure 9 shows the mAP@0.5 curves of RFE-YOLO and the comparison models during training. The horizontal axis represents the training epoch, and the vertical axis represents mAP@0.5 on the validation set. As the number of training epochs increases, the mAP@0.5 values of most models gradually improve and then tend to stabilize. Compared with the other models, RFE-YOLO maintains a relatively high mAP@0.5 during most training stages and achieves the best final performance. This indicates that the proposed model has stable convergence behavior and reliable detection performance under the same training protocol.

4.4. Visualization Results

Figure 10 shows a visual comparison between YOLOv10n and the proposed RFE-YOLO for wind turbine blade defect detection. Figure 10a presents the detection results of YOLOv10n, and Figure 10b presents the detection results of RFE-YOLO on the same blade images. In the visualization results, the bounding boxes indicate the detected defect regions, and the category labels correspond to Crack, Oil leakage, and Peel. Compared with YOLOv10n, RFE-YOLO provides more complete bounding box localization and more accurate defect recognition for small and irregular defect regions. In particular, the baseline model still shows missed detections or incomplete localization for some subtle defect areas, whereas RFE-YOLO can better capture these defect regions. These visual results are consistent with the quantitative results in Table 4 and further support the effectiveness of the proposed model.

5. Conclusions

The experimental results demonstrate that the proposed RFE-YOLO improves wind turbine blade defect detection performance while maintaining a lightweight structure. Compared with YOLOv10n, RFE-YOLO improves mAP@0.5 from 87.1% to 89.9% and mAP@0.5:0.95 from 62.71% to 64.73%. Meanwhile, the number of parameters is reduced from 2.70M to 1.91M, GFLOPs decrease from 8.4 to 5.3, and the inference speed increases from 56.2 FPS to 88.8 FPS under the current experimental hardware configuration. The ablation results further show that C2f-RFAConv, CCFM, and ELA contribute to local feature extraction, multiscale feature fusion, and defect-related feature refinement, respectively. These results indicate that RFE-YOLO achieves a favorable balance among detection accuracy, model complexity, and inference efficiency for Crack, Oil leakage, and Peel defect detection.

However, several limitations remain. Although data augmentation was used, the dataset contains only 351 original images before augmentation, which may limit the generalization ability of the model under unseen inspection conditions. In addition, the experiments were conducted on a public dataset, and domain shift may occur when the model is applied to images captured by different unmanned aerial vehicle platforms, cameras, illumination conditions, and backgrounds. The reported FPS was measured on an NVIDIA GeForce RTX 3090 GPU rather than on an embedded onboard device. Therefore, the real end-to-end latency, memory consumption, and robustness under complex environmental conditions still require further evaluation.

In future work, larger-scale field datasets will be collected from real wind turbine blade inspection scenarios using unmanned aerial vehicles. Cross-dataset validation and external dataset testing will be conducted to evaluate the generalization ability of RFE-YOLO under different wind farms, imaging devices, and environmental conditions. In addition, edge-device deployment and end-to-end latency analysis will be performed to further assess the practical applicability of the proposed method in real industrial inspection systems.

Author Contributions

H.B. proposed the research problem and conducted experimental tests with graduate students; W.D. completed Excel data processing and Python programming; Y.W. performed the final manuscript review. All authors have read and agreed to the published version of the manuscript.

Funding

No financial funding was received for this study. The APC was funded by Tiangong University.

Data Availability Statement

No additional data was used.

Acknowledgments

During the preparation of this manuscript, the authors used Microsoft Word 2021 for manuscript editing and formatting, Microsoft Excel 2021 for experimental data organization and table preparation, and Python 3.8 for data processing, model implementation, training, and evaluation. The authors have reviewed and edited the output and take full responsibility for the content of this publication.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

AP, Average Precision; AIFI, Attention-based Intra-scale Feature Interaction; BiFPN, Bidirectional Feature Pyramid Network; CBAM, Convolutional Block Attention Module; CCFM, Compact Cross-scale Feature Fusion Module; CCFF, Cross-scale Feature Fusion; CIB, Compact Inverted Block; CUDA, Compute Unified Device Architecture; ELA, Efficient Local Attention; EIOU, Extended Intersection over Union; Faster R-CNN, Faster Region-based Convolutional Neural Network; FPN, Feature Pyramid Network; FPS, frames per second; GFLOPs, giga floating-point operations; GPU, graphics processing unit; IoU, Intersection over Union; mAP, mean Average Precision; Mask R-CNN, Mask Region-based Convolutional Neural Network; PAN, Path Aggregation Network; PSA, Partial Self-Attention; RFAConv, Receptive Field Attention Convolution; RFE-YOLO, Receptive-Field-Enhanced You Only Look Once model; RT-DETR, Real-Time Detection Transformer; SCDown, Spatial-Channel Decoupled Downsampling; SGD, stochastic gradient descent; SPPF, Spatial Pyramid Pooling Fast; SSD, Single Shot MultiBox Detector; UAV, unmanned aerial vehicle; YOLO, You Only Look Once.

References

Yu, Y.; Cao, H.; Yan, X.; Wang, T.; Ge, S.S. Defect identification of wind turbine blades based on defect semantic features with transfer feature extractor. Neurocomputing 2020, 376, 1–9. [Google Scholar] [CrossRef]
Desalegn, B.; Gebeyehu, D.; Tamrat, B. Wind energy conversion technologies and engineering approaches to enhancing wind power generation: A review. Heliyon 2022, 8, e11263. [Google Scholar] [CrossRef]
Ding, S.; Yang, C.; Zhang, S. Acoustic-signal-based damage detection of wind turbine blades—A review. Sensors 2023, 23, 4987. [Google Scholar] [CrossRef]
Clocker, K.; Hu, C.; Roadman, J.; Albertani, R.; Johnston, M.L. Autonomous sensor system for wind turbine blade collision detection. IEEE Sens. J. 2021, 22, 11382–11392. [Google Scholar] [CrossRef]
Jiménez, A.A.; Márquez, F.P.G.; Moraleda, V.B.; Muñoz, C.Q.G. Linear and nonlinear features and machine learning for wind turbine blade ice detection and diagnosis. Renew. Energy 2019, 132, 1034–1048. [Google Scholar] [CrossRef]
Chandrasekhar, K.; Stevanovic, N.; Cross, E.J.; Dervilis, N.; Worden, K. Damage detection in operational wind turbine blades using a new approach based on machine learning. Renew. Energy 2021, 168, 1249–1264. [Google Scholar] [CrossRef]
Regan, T.; Beale, C.; Inalpolat, M. Wind turbine blade damage detection using supervised machine learning algorithms. J. Vib. Acoust. 2017, 139, 061010. [Google Scholar] [CrossRef]
JOSH, U.V.A.; Sugumaran, V. Crack detection and localization on wind turbine blade using machine learning algorithms: A data mining approach. Struct. Durab. Health Monit. 2019, 13, 181. [Google Scholar]
Du, Y.; Zhou, S.; Jing, X.; Peng, Y.; Wu, H.; Kwok, N. Damage detection techniques for wind turbine blades: A review. Mech. Syst. Signal Process. 2020, 141, 106445. [Google Scholar] [CrossRef]
Borja-Jaimes, V.; Adam-Medina, M.; López-Zapata, B.Y.; Vela Valdés, L.G.; Claudio Pachecano, L.; Sánchez Coronado, E.M. Sliding Mode Observer-Based Fault Detection and Isolation Approach for a Wind Turbine Benchmark. Processes 2021, 10, 54. [Google Scholar] [CrossRef]
Borja-Jaimes, V.; Adam-Medina, M.; García-Morales, J.; Guerrero-Ramírez, G.V.; López-Zapata, B.Y.; Sánchez-Coronado, E.M. Actuator FDI Scheme for a Wind Turbine Benchmark Using Sliding Mode Observers. Processes 2023, 11, 1690. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar] [CrossRef]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
Shihavuddin, A.S.M.; Chen, X.; Fedorov, V.; Christensen, A.N.; Riis, N.A.B.; Branner, K.; Paulsen, R.R. Wind turbine surface damage detection by deep learning aided drone inspection analysis. Energies 2019, 12, 676. [Google Scholar] [CrossRef]
Diaz, P.M.; Tittus, P. Fast detection of wind turbine blade damage using Cascade Mask R-DSCNN-aided drone inspection analysis. Signal Image Video Process. 2023, 17, 2333–2341. [Google Scholar] [CrossRef]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 8–16 October 2016; pp. 21–37. [Google Scholar]
Zhao, H.; Gao, Y.; Deng, W. Defect detection using ShuffleNet-CA-SSD lightweight network for turbine blades in IoT. IEEE Internet Things J. 2024, 11, 32804–32812. [Google Scholar] [CrossRef]
Lv, L.; Yao, Z.; Wang, E.; Ren, X.; Pang, R.; Wang, H.; Wu, H. Efficient and accurate damage detector for wind turbine blade images. IEEE Access 2022, 10, 123378–123386. [Google Scholar] [CrossRef]
Cao, J.; Peng, B.; Gao, M.; Hao, H.; Guo, J.; Liu, X.; Liu, W. Detecting small damage on wind turbine surfaces using an improved YOLO in drone-captured scenes. J. Fail. Anal. Prev. 2025, 25, 725–740. [Google Scholar] [CrossRef]
Yan, X.; Wu, G.; Zuo, Y. YOLOv4-based wind turbine blade crack defect detection. In Proceedings of In-coME-VI and TEPEN, Tianjin, China, 20–23 October 2021; Springer: Berlin/Heidelberg, Germany, 2022; pp. 293–305. [Google Scholar]
Zhang, Y.; Yang, Y.; Sun, J.; Ji, R.; Zhang, P.; Shan, H. Surface defect detection of wind turbine based on lightweight YOLOv5s model. Measurement 2023, 220, 113222. [Google Scholar] [CrossRef]
Fu, Z.; Zhang, F.; Ren, X.; Hao, B.; Zhang, X.; Yin, C.; Zhang, Y. LE-YOLO: Lightweight and efficient detection model for wind turbine blade defects based on improved YOLO. IEEE Access 2024, 12, 135985–135998. [Google Scholar] [CrossRef]
Liu, L.; Li, P.; Wang, D.; Zhu, S. A wind turbine damage detection algorithm designed based on YOLOv8. Appl. Soft Comput. 2024, 154, 111364. [Google Scholar] [CrossRef]
Khanam, R.; Hussain, M. YOLOv11: An overview of the key architectural enhancements. arXiv 2024, arXiv:2410.17725. [Google Scholar] [CrossRef]
Tian, Y.; Ye, Q.; Doermann, D. YOLOv12: Attention-centric real-time object detectors. arXiv 2025, arXiv:2502.12524. [Google Scholar]
Lei, M.; Li, S.; Wu, Y.; Hu, H.; Zhou, Y.; Zheng, X.; Gao, Y. Yolov13: Real-time object detection with hypergraph-enhanced adaptive visual perception. arXiv 2025, arXiv:2506.17733. [Google Scholar]
Liao, L.; Song, C.; Wu, S.; Fu, J. A novel YOLOv10-based algorithm for accurate steel surface defect detection. Sensors 2025, 25, 769. [Google Scholar] [CrossRef] [PubMed]
Cheng, S.; Wang, Z.; Liu, S.; Han, Y.; Sun, P.; Li, J. Attention-based lightweight YOLOv8 underwater target recognition algorithm. Sensors 2024, 24, 7640. [Google Scholar] [CrossRef]
Varghese, R.; Sambath, M. YOLOv8: A novel object detection algorithm with enhanced performance and robustness. In Proceedings of the 2024 International Conference on Advances in Data Engineering and Intelligent Computing Systems (ADICS), Chennai, India, 18–19 April 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 1–6. [Google Scholar]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path aggregation network for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 8759–8768. [Google Scholar]
Wang, K.; Liew, J.H.; Zou, Y.; Zhou, D.; Feng, J. Panet: Few-shot image semantic segmentation with prototype alignment. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9197–9206. [Google Scholar]
Li, S.; He, C.; Li, R.; Zhang, L. A dual weighting label assignment scheme for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 9387–9396. [Google Scholar]
Zhang, H.; Zu, K.; Lu, J.; Zou, Y.; Meng, D. EPSANet: An efficient pyramid squeeze attention block on convolutional neural network. In Proceedings of the Asian Conference on Computer Vision, Macao, China, 4–8 December 2022; pp. 1161–1177. [Google Scholar]
Zhang, X.; Liu, C.; Yang, D.; Song, T.; Ye, Y.; Li, K.; Song, Y. RFAConv: Innovating spatial attention and standard convolutional operation. arXiv 2023, arXiv:2304.03198. [Google Scholar]
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Chen, J. Detrs beat yolos on real-time object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, Seattle, WA, USA, 17–21 June 2024; pp. 16965–16974. [Google Scholar]
Xu, W.; Wan, Y.; Zhao, W. ELA: Efficient location attention for deep convolution neural networks. J. Real-Time Image Process. 2025, 22, 140. [Google Scholar] [CrossRef]

Figure 1. Structure of YOLOv10n.

Figure 2. RFAConv structure diagram.

Figure 3. Structure of the C2f-Receptive Field Attention Convolution (C2f-RFAConv) module: (a) original C2f module; (b) Receptive Field Attention Convolution (RFAConv) module; (c) proposed C2f-RFAConv module.

Figure 4. CCFF Structure Diagram.

Figure 5. CCFM Structure Diagram.

Figure 6. ELA Structure Diagram.

Figure 7. RFE-YOLO structure diagram.

Figure 8. (a) Confusion Matrix Normalized; (b) Precision-Recall Curve.

Figure 9. Performance comparison among different detection algorithms.

Figure 10. Visualization results: (a) Visualization results of YOLOv10n; (b) Visualization results of RFE-YOLO.

Table 1. Hyperparameter settings for network training.

Parameter	Configurations
Image Size	640 × 640
Random seed	42
Initial learning rate	0.01
Final learning rate factor	0.01
Batch Size	16
Momentum	0.937
Optimizer	SGD
Weight decay	0.0005
Training epochs	300
Early stopping	Not used
Model selection	Best validation mAP@0.5

Table 2. Hyperparameter settings for data augmentation.

Parameter	Configurations
Flip	0.1
Scale Transform	0.5
Vertical Flip	0.0
Horizontal Flip	0.5
Mosaic Enhancement	1.0

Table 3. Ablation study results of the proposed modules in RFE-YOLO.

C2f-RFAConv	CCFM	ELA		AP (%)		mAP@0.5 (%)	Params (M)	GFLOPs (G)	FPS
C2f-RFAConv	CCFM	ELA	Crack	Oil Leakage	Peel	mAP@0.5 (%)	Params (M)	GFLOPs (G)	FPS
			83.3	95.8	82.1	87.1	2.7	8.4	56.2
✓			88.1	95.8	82.6	88.9	2.73	8.5	54.1
	✓		88.7	94.3	76.5	86.5	1.98	6.4	75.6
		✓	91.6	94.9	76.8	87.8	2.61	8	60.4
✓	✓		90	91.3	86.3	89.2	2.03	6.6	68.2
✓		✓	88.5	92.6	84.3	88.5	2.65	7.1	63.4
	✓	✓	87.8	94.2	82.6	88.2	1.96	5.9	79.8
✓	✓	✓	86.6	95.2	87.9	89.9	1.91	5.3	88.8

Table 4. Performance comparison of RFE-YOLO with representative detection methods.

Method	P (%)	R (%)		AP (%)		mAP@0.5 (%)	mAP@0.5: 0.95 (%)	Params (M)	GFLOPs (G)	FPS
Method	P (%)	R (%)	Crack	Oil Leakage	Peel	mAP@0.5 (%)	mAP@0.5: 0.95 (%)	Params (M)	GFLOPs (G)	FPS
SSD	20.9	72.1	90.4	32.7	64.1	62.4	44.93	35.64	34.86	13.1
FasterR-CNN	44.6	82.1	91	70	69	76.8	55.3	41.8	142.4	6.8
YOLOv5n	92.5	78.2	88.4	93.8	81.2	87.8	63.22	1.9	5.1	82.1
YOLOv7-tiny	72.4	62	44	90.8	67.1	67.3	48.46	6.02	13.02	19.8
YOLOv8n	88.4	85.1	85.3	93.4	76.6	85.1	61.27	3.01	8.2	76.8
YOLOv9-tiny	85.2	80.9	87.4	90.8	78.2	85	61.2	2.84	12.1	22.2
YOLOv10n	81.5	81.2	83.3	95.8	82.1	87.1	62.71	2.7	8.4	56.2
YOLOv11n	85.4	64.7	75.4	91.3	62.8	76.5	55.08	2.59	6.4	41.3
YOLOv12n	77.7	75.5	76.7	90.9	76	81.2	58.46	2.5	5.8	35.4
YOLOv13n	86.5	81.5	84.8	93.7	84.5	87.7	63.14	2.46	6.4	39.6
RFE-YOLO	83.9	85	86.6	95.2	87.9	89.9	64.73	1.91	5.3	88.8

Table 5. Robustness analysis under different random seeds.

Method	Seed42 mAP@0.5 (%)	Seed5 mAP@0.5 (%)	Seed15 mAP@0.5 (%)	Mean ± Std mAP@0.5 (%)	Mean ± Std mAP@0.5:0.95 (%)
YOLOv5n	87.8	87.6	88.2	87.9 ± 0.3	63.3 ± 0.2
YOLOv10n	87.1	86.5	87.3	87 ± 0.4	62.6 ± 0.3
YOLOv13n	87.7	87.4	88.2	87.8 ± 0.4	63.2 ± 0.3
RFE-YOLO	89.9	90.1	90.4	90.1 ± 0.3	64.9 ± 0.2

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Bai, H.; Dong, W.; Wu, Y. Optimized RFE-YOLO Method for Identifying Defects in Wind Turbine Blades. Appl. Sci. 2026, 16, 5070. https://doi.org/10.3390/app16105070

AMA Style

Bai H, Dong W, Wu Y. Optimized RFE-YOLO Method for Identifying Defects in Wind Turbine Blades. Applied Sciences. 2026; 16(10):5070. https://doi.org/10.3390/app16105070

Chicago/Turabian Style

Bai, Hua, Wei Dong, and Yanwei Wu. 2026. "Optimized RFE-YOLO Method for Identifying Defects in Wind Turbine Blades" Applied Sciences 16, no. 10: 5070. https://doi.org/10.3390/app16105070

APA Style

Bai, H., Dong, W., & Wu, Y. (2026). Optimized RFE-YOLO Method for Identifying Defects in Wind Turbine Blades. Applied Sciences, 16(10), 5070. https://doi.org/10.3390/app16105070

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Optimized RFE-YOLO Method for Identifying Defects in Wind Turbine Blades

Abstract

1. Introduction

2. Method

2.1. The Architecture of YOLOv10n

2.2. RFAConv

2.3. CCFM

2.4. ELA

2.5. RFE-YOLO Network Structure

3. Experiments

3.1. Datasets

3.2. Experimental Environment and Evaluation Metrics

4. Analysis of Experimental Results

4.1. Training Result Analysis Plot

4.2. Ablation Experiments

4.3. Comparison Experiments

4.4. Visualization Results

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI