MSWindD-YOLO: A Lightweight Edge-Deployable Network for Real-Time Wind Turbine Blade Damage Detection in Sustainable Energy Operations

Li, Pan; Zhou, Jitao; Zeng, Jian; Zhao, Qian; Yang, Qiqi

doi:10.3390/su17198925

Open AccessArticle

MSWindD-YOLO: A Lightweight Edge-Deployable Network for Real-Time Wind Turbine Blade Damage Detection in Sustainable Energy Operations

by

Pan Li

^1,2,

Jitao Zhou

^1,*

,

Jian Zeng

³

,

Qian Zhao

¹ and

Qiqi Yang

¹

School of Emergency Equipment, North China Institute of Science and Technology, Langfang 065201, China

²

Hebei Key Laboratory for Mining Equipment Safety Monitoring, Langfang 065201, China

³

School of Materials Science and Engineering, Shanghai Jiao Tong University, Shanghai 200240, China

^*

Author to whom correspondence should be addressed.

Sustainability 2025, 17(19), 8925; https://doi.org/10.3390/su17198925

Submission received: 26 August 2025 / Revised: 27 September 2025 / Accepted: 30 September 2025 / Published: 8 October 2025

Download

Browse Figures

Versions Notes

Abstract

Wind turbine blade damage detection is crucial for advancing wind energy as a sustainable alternative to fossil fuels. Existing methods based on image processing technologies face challenges such as limited adaptability to complex environments, trade-offs between model accuracy and computational efficiency, and inadequate real-time inference capabilities. In response to these limitations, we put forward MSWindD-YOLO, a lightweight real-time detection model for wind turbine blade damage. Building upon YOLOv5s, our work introduces three key improvements: (1) the replacement of the Focus module with the Stem module to enhance computational efficiency and multi-scale feature fusion, integrating EfficientNetV2 structures for improved feature extraction and lightweight design, while retaining the SPPF module for multi-scale context awareness; (2) the substitution of the C3 module with the GBC3-FEA module to reduce computational redundancy, coupled with the incorporation of the CBAM attention mechanism at the neck network’s terminus to amplify critical features; and (3) the adoption of Shape-IoU loss function instead of CIoU loss function to facilitate faster model convergence and enhance localization accuracy. Evaluated on the Wind Turbine Blade Damage Visual Analysis Dataset (WTBDVA), MSWindD-YOLO achieves a precision of 95.9%, a recall of 96.3%, an mAP@0.5 of 93.7%, and an mAP@0.5:0.95 of 87.5%. With a compact size of 3.12 MB and 22.4 GFLOPs inference cost, it maintains high efficiency. After TensorRT acceleration on Jetson Orin NX, the model attains 43 FPS under FP16 quantization for real-time damage detection. Consequently, the proposed MSWindD-YOLO model not only elevates detection accuracy and inference efficiency but also achieves significant model compression. Its deployment-compatible performance in edge environments fulfills stringent industrial demands, ultimately advancing sustainable wind energy operations through lightweight lifecycle maintenance solutions for wind farms.

Keywords:

wind turbine blade; damage detection; YOLO; lightweight model; real-time detection; edge computing

1. Introduction

Currently, the conventional fossil energy system is encountering dual challenges of resource depletion and environmental crises, posing a significant threat to the sustainable development of humanity [1,2]. As a critical component of the renewable energy mix, wind power has emerged as an indispensable element in the global energy transition strategy. Its widespread adoption can alleviate the strain caused by the depletion of fossil fuels and facilitate the achievement of carbon neutrality by eliminating carbon emissions [3,4]. Through technological advancements and cost reductions, wind power generation is evolving from a supplementary energy source to a primary one, playing an essential role in ensuring energy security and driving sustainable economic growth [5]. The wind turbine blade serves as a critical component in maintaining the stable operation of the generator unit. However, prolonged exposure to various environmental factors such as intense wind loads and salt fog erosion subjects the blade to multiple physical stresses, leading to common issues like wear and cracks [6]. These flaws not only diminish power generation efficiency but also pose a risk of catastrophic fractures due to stress concentration, jeopardizing the safe operation of the turbine [7]. Hence, prompt detection and precise identification of the location and nature of blade damage are crucial for prolonging its lifespan and sustaining the optimal performance of wind turbines.

Progress in image analysis has made machine-vision-based detection a focus in the quest to monitor wind turbine blades for surface damage [8]. While traditional image processing methods perform well in identifying simple, obvious damage, they frequently suffer from insufficient precision and dependability under complex, subtle damage or varying environmental conditions. In sharp contrast, deep learning techniques employ an end-to-end learning paradigm, where neural networks automatically learn to extract critical features from raw pixels without manual intervention. Researchers have broadly grouped these techniques into two main streams within the realm of deep learning-based damage detection technologies.

One approach employs a two-stage detection framework based on region proposal networks, with representative implementations including classic models such as Faster R-CNN [9] and Mask R-CNN [10]. Shi et al. [11] put forward a new framework for non-destructive testing (MSE-Faster R-CNN) based on an enhanced multi-scale Faster R-CNN architecture. Utilizing VGG-19 as the backbone network, the model generated candidate crack regions of interest. These candidate regions were then refined using the multi-scale Faster R-CNN detector. To improve crack detection performance, they also constructed a comprehensive feature dataset for training. Experimental results demonstrated that this approach achieved an accuracy of 80.6% in detecting cracks on blades across varying scales. Zhang et al. [12] introduced a Mask R-CNN-based framework incorporating specialized image enhancement and data augmentation preprocessing for wind turbine blade inspection. Through architecture fine-tuning and hyperparameter optimization, they integrated hyperparameter-optimized image processing pipelines, yielding a 1.64% higher mAP@0.5 than the baseline Mask R-CNN implementation. Diaz et al. [13] proposed a defect detection and classification model for wind turbine blades based on an improved Cascade Mask R-CNN. They replaced the backbone’s standard convolutions with depthwise separable convolutions and integrated adaptive image enhancement strategies with cross-domain transfer learning to establish a multi-scale feature fusion network. The experimental findings demonstrated that the model attained a mean average precision (mAP) of 82.42%, validating its applicability in industrial inspection scenarios. Despite the superior advantages of two-stage models in identifying wind turbine blade damage, their practical engineering applications still face two core bottlenecks. For one thing, the large parameter size of such models leads to high computational cost requirements. For another, their multi-stage processing architecture introduces latency issues that compromise real-time detection capabilities and increases the technical challenges for lightweight deployment on edge computing devices.

To overcome the disadvantages of two-stage object detection models, a different category of regression-based single-stage frameworks, exemplified by the SSD [14] and YOLO [15] series algorithms, has been introduced. While these models offer notable advantages in terms of detection speed, their feature representation capacity and detection accuracy are commonly perceived as less robust compared to two-stage models. In view of this limitation, the research on this kind of model mainly focuses on improving the detection accuracy. For instance, Ran et al. [16] introduced AFB-YOLO, an enhanced framework based on YOLOv5s that incorporates attention mechanism optimization and feature equilibrium strategies. Specifically, they developed a cross-scale weighted feature fusion network to strengthen multi-scale feature extraction, employed Coordinate Attention (CA) mechanisms to enhance defect-region feature representation, and implemented the EIoU loss function to refine target localization precision. This approach achieved 83.7% detection accuracy, representing a statistically significant 4% absolute improvement over the YOLOv5s baseline. Yao et al. [17] introduced an efficient YOLOX-based model featuring cascaded fusion. They enhanced the model’s lightweight characteristics by integrating a RepVGG structure into the backbone network to accelerate inference, constructed a cascaded feature fusion module to improve multi-scale small target detection, and implemented focal loss to enhance complex sample learning. Experiments demonstrated that the optimized model achieved an mAP of 94.29%, enabling automated damage localization and classification. Zhao et al. [18] proposed YOLO-Wind, an enhanced YOLOv8n-based algorithm for wind turbine surface damage detection. Their approach integrated MobileNetV2′s MBConv modules with an Efficient Channel Attention (ECA) mechanism into the C2f module to improve feature extraction for subtle defects. Furthermore, a shallower, higher-resolution P2 detection layer was incorporated into the multi-scale detection head to enhance multi-target recognition in complex environments, enabling the model to reach an mAP@0.5 of 83.9%. Notably, while multi-dimensional improvements to YOLOv8 have achieved industry-leading detection accuracy, experimental data reveal that its per-frame processing time still does not meet industrial-grade real-time detection requirements despite maintaining high detection precision, and it is prone to fluctuations in computational resource consumption in multi-scale defect detection scenarios, indicating that further research is needed to optimize the synergy between detection accuracy and computational efficiency.

In summary, for wind turbine blade damage detection, two-stage detection algorithms offer superior localization accuracy, while single-stage methods excel in real-time performance. However, in the unique operational environments of wind farms, both types of deep learning-based blade damage detection methods face shared core technical challenges. For one thing, complex environmental noise reduces the visual distinctiveness between defects and backgrounds, worsening the missed detection of small-scale damage features. For another, existing models face a performance trade-off dilemma when simultaneously constrained by low-power deployment requirements and high-accuracy detection demands. Particularly when detecting micrometer-scale cracks on long blade surfaces, traditional inspection frameworks are susceptible to artifact interference, resulting in false positive samples exceeding acceptable engineering thresholds.

To address the aforementioned challenges, this study proposes MSWindD-YOLO, a lightweight intelligent detection model for wind turbine blade damage, aiming to simultaneously optimize detection accuracy and computational efficiency. The design concept of the MSWindD-YOLO model can be succinctly described as follows: leveraging the YOLOv5s framework as the foundation, the backbone network undergoes optimization through the integration of a lightweight network architecture and multi-scale feature capture modules. This involves the replacement of the conventional feature extraction module with a high-performance alternative to improve the efficiency and accuracy of feature extraction. Additionally, an innovative feature extraction module is incorporated, along with the introduction of an attention mechanism to bolster the model’s capacity for identifying crucial features. Lastly, the redefinition of the loss function is used for enhancing the model’s precision and robustness in complex environments. Accelerated with TensorRT and deployed on the Jetson Orin NX edge computing device, this model enables accurate and stable detection of wind turbine blade damage. This study offers technical assistance for the effective and precise identification of damage to wind turbine blades. Such support enables prompt hazard alerts, enhances safety measures, prolongs turbine lifespan, and lowers operational expenses. Ultimately, this contributes to the stable, economical, and sustainable utilization of wind energy resources. Consequently, it plays a crucial role in supporting the wind power sector’s contribution to global energy transition and sustainable development goals.

2. Materials and Methods

2.1. Data Collection

In this work, wind turbine blade damage images were collected, covering four primary categories: surface crack, surface attachment, surface corrosion, and mechanical damage. These images were used to construct the dataset for training, validation, and testing during subsequent model development. The wind turbine blade damage images were collected at Guanting Wind Farm (115°59′3″ E, 40°27′55″ N) in Yanqing District, Beijing, China. Seasonal wind variations at this site are characterized by higher wind speeds during winter (November to March), corresponding to peak generation periods, and lower speeds from May to August due to monsoon effects. To ensure safe UAV operations and high-precision imaging of critical blade damage (including micro-cracks, corrosion, and material degradation), image acquisition was conducted at Guanting Wind Farm during April–June 2024. This period coincided with scheduled turbine maintenance shutdowns, eliminating collision risks from rotating blades while capturing data under varied lighting, temperature, and humidity conditions. To quantitatively address the environmental diversity of the dataset, the image capture missions were strategically scheduled across different times of day (06:00–08:00, 10:00–14:00, 16:00–18:00) and under varying weather conditions (clear, overcast, light rain). This resulted in a balanced distribution of lighting conditions: approximately 35% of images under soft frontal lighting (morning/evening), 45% under strong overhead sunlight (midday), and 20% under diffuse lighting (overcast). Furthermore, the UAV was operated at multiple flight paths to capture blade surfaces from a diverse range of angles, including frontal (0–30°), oblique (30–60°), and side views (60–85°), with each category representing roughly 40%, 40%, and 20% of the dataset, respectively. During data capture, an XS-X40 drone equipped with a SIYI-ZR10 gimbal camera was deployed, capturing 3550 high-resolution (4080 × 4080 pixels) blade damage images. This deliberate collection strategy—capturing images during scheduled maintenance under diverse environmental conditions—ensures that the dataset closely mirrors real-world operational scenarios. This diversity is crucial for training and evaluating a robust model capable of reliable performance in the variable conditions encountered in practical wind farm inspections, thereby highlighting a key strength of our proposed method.

2.2. Data Preprocessing

To ensure model training efficiency, 800 high-quality images were selected from the acquired dataset to construct the Wind Turbine Blade Damage Visual Analysis Dataset (WTBDVA). These images were compressed to 640 × 640 pixels resolution to reduce GPU memory consumption during training. Annotations were generated using LabelImg v1.8.1 in YOLO-compatible formats. To enhance model robustness and generalization capability while mitigating the risks of overfitting or underfitting caused by class imbalance, data augmentation techniques including image flipping, random rotation, Gaussian noise addition, Gaussian blur, and random brightness adjustment were applied. This expanded the WTBDVA dataset from 800 to 4450 images. It is important to note that the data augmentation techniques were applied uniformly across all damage categories. This balanced strategy ensured that the proportional distribution of each class remained consistent with the original dataset, with the ratio of augmented-to-original samples being approximately 4.56:1 for every category, thereby effectively preventing the introduction of any artificial bias. The techniques and their parameter ranges were as follows: random rotation within ±15 degrees; random horizontal and vertical flipping; Gaussian noise addition with a mean of 0 and a standard deviation randomly sampled from [0, 25] on a 0–255 pixel scale; Gaussian blur with a kernel size randomly selected from {3, 5, 7} pixels; and random brightness adjustment with a delta factor uniformly sampled from [−0.3, +0.3]. These augmentation techniques, with their specified parameters, were specifically employed to simulate a wider array of real-world imaging conditions. This process directly supports the novelty of our lightweight MSWindD-YOLO model by ensuring it learns invariant features from a rich and varied dataset, which is essential for achieving high accuracy and generalization on unseen edge device data without overfitting. To optimize training performance, the dataset (WTBDVA) is randomly partitioned into training, validation, and test sets at an 8:1:1 ratio. The composition of the dataset, detailing the original and augmented sample counts for each class to demonstrate the maintained balance, is presented in Table 1. All models in this study were evaluated using the WTBDVA dataset, with the test set specifically employed for ablation experiments and performance benchmarking against existing methodologies.

To mitigate test set dependency and enhance data utilization efficiency, we implemented the 5-fold cross-validation approach [19]. The original WTBDVA dataset was partitioned into five mutually exclusive subsets. In each iteration, four subsets formed the training set, with the remaining subset serving as the test set. After five iterations, we computed the mean performance metric as the final evaluation criterion. Figure 1 illustrates this 5-fold cross-validation process, which reduces effects of randomness in data partitioning while fully leveraging limited dataset information. The adoption of 5-fold cross-validation not only mitigates overfitting and validates the model’s stability but also rigorously demonstrates the consistent performance of our proposed method across different data partitions. This provides statistically robust evidence for the strength and reliability of our model, a core advantage over evaluations using a single fixed train-test split.

2.3. MSWindD-YOLO Model Establishment

2.3.1. Overview of the Network Architecture

As a single-stage object detection algorithm, YOLOv5 [20] enables high-speed, high-accuracy real-time detection. The core idea of YOLOv5 is to take an entire image as input to the neural network and divide it into multiple grid cells. However, each grid cell does not predict merely one object; instead, it predicts multiple potential targets through predefined anchor boxes. Specifically, each grid cell generates predictions for each anchor box, including bounding box coordinates and class probabilities. Consequently, a single grid cell can detect multiple objects depending on the number of anchor boxes. This design formulates object recognition as a regression problem, balancing detection diversity and computational efficiency.

The surface of wind turbine blades often exhibits a diverse array of damage patterns, varying in shape, size, and textural properties, which can complicate the detection process. Wind farms are typically situated in remote, rugged mountainous regions or open areas, characterized by complex backgrounds and variable natural lighting conditions, factors that can readily interfere with image-based detection methods. These challenges necessitate the development of detection models with enhanced accuracy and robustness to meet the stringent requirements of wind power blade maintenance. YOLOv5s [21] is a computationally efficient variant of the YOLOv5 object detection framework. It achieves real-time performance using depthwise separable convolutions and a cross-stage feature fusion strategy. The multi-scale feature extraction architecture integrates Feature Pyramid Network (FPN) and Path Aggregation Network (PAN) modules, coupled with an adaptive anchor box generation mechanism. This allows the model to accurately detect defect targets with significant aspect ratio differences, as often encountered in wind turbine blade damage assessment. The modular design of YOLOv5s further enables flexible scalability of the network structure and incorporation of attention mechanisms. Hence, the YOLOv5s framework has been chosen as the foundational framework for developing a novel model for detecting damage in wind turbine blades.

The network structure of the improved wind turbine blade damage detection model (MSWindD-YOLO) is illustrated in Figure 2. To address computational redundancy in the original Focus module within the backbone network, it was replaced by a Stem module. This module demonstrates superior computational efficiency and multi-scale fusion capability, enabling more effective initial feature extraction from input images. Consequently, richer, more accurate information is provided for subsequent processing stages. Furthermore, the EfficientNetV2 architecture was incorporated, leveraging depthwise separable convolutions and attention mechanisms to maintain feature extraction capability while reducing model complexity. The original SPPF (Spatial Pyramid Pooling-Fast) module for spatial pyramid pooling was retained, with its multi-scale representation enhanced through an adaptive receptive field adjustment mechanism. In the Neck network, the original C3 feature extraction module was replaced with a novel GBC3-FEA module. The GBC3-FEA module incorporates a dynamic grouped convolution strategy combined with channel reparameterization techniques, enabling a lightweight design for the Neck network while preserving its feature expression capability. At the end of the feature pyramid, a hybrid attention mechanism (CBAM) was introduced. By enhancing the response intensity to key features through channel-space joint attention modeling, this mechanism effectively alleviates the issue of target missed detection under complex backgrounds. Furthermore, the original CIoU loss function was replaced with the Shape-IoU loss function, which leverages geometric prior information of the target bounding box to establish a dynamic weighting adjustment mechanism, thereby improving model training speed and localization accuracy.

The process flow for utilizing the enhanced wind turbine blade damage detection model (MSWindD-YOLO) involves several steps. Initially, a high-definition imaging module installed on a UAV platform captures real-time high-definition images of the blade surface. Subsequently, an airborne edge calculation unit conducts noise reduction, resolution standardization, and normalization procedures to produce standardized tensor data that conform to the input standards of the deep learning model. The processed data is then sent to the inference system via a low-latency communication protocol. Within the inference system, the feature extraction network’s Stem module carries out the initial extraction of shallow feature representations. Following this, deep semantic features are progressively extracted through the layers of the EfficientNetV2 architecture. Ultimately, the SPPF module facilitates the effective integration of multi-scale contextual information.

During the feature fusion stage, the GBC3-FEA module combines detailed multi-level information and semantic features using lightweight convolution. It selectively focuses on essential channels and spatial regions through the CBAM attention mechanism to effectively reduce background interference. The detection module then forecasts the coordinates, categories, and confidence levels of the damage’s bounding box using a multi-scale feature map. It employs the Shape-IoU loss function to enhance the precision of localization. Subsequently, redundant boxes are eliminated through non-maximum suppression. The resulting detection outcomes are structured and overlaid on the original image to create a visual report, which is promptly transmitted to the monitoring terminal in real-time.

2.3.2. Lightweight Optimization of Backbone Network

To enhance the model’s lightweight efficiency, the advanced convolutional neural network structure of EfficientNetV2 [22] was strategically integrated into the backbone network. This approach aims to minimize the use of limited computing resources during wind turbine blade surface damage detection while ensuring real-time data transmission.

EfficientNetV2 is a convolutional neural network architecture that strategically adapts network depth, width, and resolution using a systematic scaling approach, demonstrating outstanding performance within resource-constrained environments. As illustrated in Figure 3, its core structure is primarily composed of MBConv and Fused-MBConv modules.

The MBConv module [23], serving as the core component of the EfficientNetV2 architecture, comprises a series of operations including a 1 × 1 pointwise convolution (expansion stage), a 3 × 3 depthwise convolution, a Squeeze-Excitation (SE) module, a 1 × 1 pointwise convolution (projection stage), and a skip connection. Initially, the 1 × 1 pointwise convolution in the expansion stage increases the number of input channels to expand feature dimensionality, enabling subsequent convolutions to capture richer contextual information. This is followed by a 3 × 3 depthwise convolution that extracts spatial features while operating independently on each input channel, significantly reducing model parameters and computational complexity. Subsequently, the Squeeze-Excitation module dynamically recalibrates channel-wise feature responses through an attention mechanism, emphasizing informative channels while suppressing less relevant ones. The following 1 × 1 pointwise convolution in the projection stage then reduces the channel count back to the original dimension, consolidating and compressing the extracted spatial features. Finally, a skip connection is employed to add the input tensor to the output of the projection stage when spatial dimensions and channel numbers match, thereby mitigating gradient vanishing issues, preserving feature diversity, and facilitating efficient gradient flow during training. This modular design achieves a balance between model capacity and computational efficiency while maintaining representation power.

The Fused-MBConv module [24] integrates design principles from standard convolution and the MBConv architecture to enhance feature extraction efficiency. In shallow network layers with limited input channels, this module replaces the conventional depthwise separable convolution in traditional MBConv with a single standard convolutional layer. This substitution enables simultaneous spatial feature extraction and cross-channel information fusion within a unified operation. The workflow proceeds as follows: Initial feature extraction is performed through a standard convolutional layer, followed by Batch Normalization to accelerate training convergence and mitigate internal covariate shift. Nonlinear activation is introduced via SiLU or ReLU functions to enhance model expressiveness. By leveraging the dense connectivity characteristic of standard convolutions, this architecture maintains lightweight computation while improving the representational capacity of shallow-layer features, particularly beneficial in scenarios with constrained input channels. Compared to the fragmented computation of separate depthwise and pointwise convolutions in conventional MBConv, the fused structure reduces computational fragmentation and optimizes hardware utilization. This design achieves a superior accuracy-efficiency trade-off on resource-constrained platforms such as mobile devices, demonstrating the efficacy of architectural fusion in balancing performance and computational efficiency.

2.3.3. GBC3-FEA Feature Extraction Module

Deep convolutional neural networks typically incorporate numerous convolutional layers, inevitably incurring substantial computational costs. To address this, architectures such as MobileNet [25] and ShuffleNet [26] employ strategies like depthwise separable convolutions and channel shuffling, frequently combined with smaller filters for efficiency. However, a significant computational overhead persists in the 1 × 1 pointwise convolutions used for channel fusion, which remain a non-negligible source of memory consumption and FLOPs.

Regarding the ubiquity of redundancy in the computation process of intermediate feature maps within mainstream convolutional neural networks, existing literature [27] has proposed mitigating this issue by reducing the computational resources required to generate these feature maps—specifically, through optimizing the quantity of convolutional filters employed in their generation. Formally, given input data X ∈ R^c^×h×w (where c h and w denote the number of channels, height and width, respectively), the transformation performed by a convolutional layer to produce n feature maps can be formulated as Equation (1):

Y = X ∗ f + b,

(1)

In Equation (1), ∗ denotes the convolution operation, b ∈ Rⁿ is the bias vector, Y ∈ R^h^′×w′×n is the output feature map with n channels, and f ∈ R^c×k×k×n represents the convolutional filters. Here, h′ and w′ denote the height and width of the output data, respectively, and k is the spatial dimension of the kernel. Given that both the number of filters n and input channels c are typically large, the resulting FLOPs can easily reach an order of millions or even billions, posing a significant computational burden.

According to Equation (1), the number of parameters to be optimized (including convolutional filters f and bias vector b) explicitly depends on the dimensions of input and output feature maps. The feature maps generated by convolutional layers often exhibit significant redundancy, with partial feature maps potentially demonstrating high similarity. We argue that individually generating these redundant feature maps requires substantial computational resources (FLOPs) and parameter counts, which is not indispensable. Assuming output feature maps can be generated as “ghost” features through low-cost transformations applied to a small number of intrinsic feature maps, these intrinsic feature maps typically have smaller dimensions and are generated by standard convolutional layers. Specifically, m intrinsic feature maps Y’ ∈ R^h^′×w′×m can be generated through primary convolution operations:

Y′ = X ∗ f ′,

(2)

In Equation (2), f ′ ∈ R^c^×k×k×m represents the convolutional filters used (with m ≤ n). For brevity, the bias vectors are omitted. Hyperparameters such as filter size, stride, and padding scheme remain identical to those in standard convolution (Equation (1)) to ensure consistency in spatial dimensions (h′ and w′) of the output feature maps. Building upon this foundation, to ultimately obtain the required n feature maps, a series of low-complexity linear operations are applied to each intrinsic feature map Y′ to generate s ghost features. This process can be mathematically expressed as Equation (3):

y_{i j} = Φ_{i, j} ({y_{i}}^{’}), \forall i = 1, \dots, m, j = 1, \dots, s,

(3)

In Equation (3),

y_{i}^{’}

∈ Y′ denotes the i-th intrinsic feature map. The symbol Φ_i_,j represents the j-th linear transformation applied to

y_{i}^{’}

for the generation of the j-th ghost feature map, where j ∈ [1, s − 1]. The last operation, Φ_i_,s, is an identity mapping used to preserve the original intrinsic feature map, as illustrated in Figure 4. Thus, Equation (3) yields n = m · s output feature maps in total, denoted as Y = [y₁₁, y₁₂, …, y_ms]. Critically, these linear transformations Φ (e.g., 3 × 3 depthwise convolutions) are channel-wise operations, which incur significantly lower computational complexity than standard convolutions.

Given an input feature map X ∈ R^c^×h×k, the Ghost module first generates m intrinsic feature maps Y′ using a primary k × k convolution. A d × d depthwise convolution is then applied to each feature map in Y′ as a linear transformation to produce s ghost feature maps per intrinsic map. The final output A ∈ Rⁿ^×h′×w′ is formed by concatenating these n feature maps. The computational costs, measured in FLOPs, for generating n output feature maps via conventional convolution (denoted as B) and the proposed Ghost convolution (denoted as C) are compared in Equation (4). It is evident from Equation (4) that standard convolution requires roughly s times more computation than Ghost convolution.

r = \frac{c ’ \cdot h ’ \cdot w ’ \cdot c \cdot k \cdot k}{\frac{c ’}{s} \cdot h ’ \cdot w ’ \cdot c \cdot k \cdot k + (s - 1) \frac{c ’}{s} \cdot h ’ \cdot w ’ \cdot d \cdot d} = \frac{c \cdot k \cdot k}{\frac{1}{s} \cdot c \cdot k \cdot k + \frac{(s - 1)}{s} \cdot d \cdot d} \approx \frac{s \cdot c}{s + c - 1} \approx s

(4)

To enable the deployment of CNNs on resource-constrained devices by lowering their computational demands, we integrated the Ghost module (Figure 5a) into a standard bottleneck structure, forming a Ghost bottleneck module. This bottleneck design has two variants, distinguished by their stride values (stride = 1 and stride = 2). As shown in Figure 5b, the first variant (stride = 1) stacks two Ghost modules with a residual connection that bypasses the input to the output. Batch normalization (BN) and ReLU activation follow the first Ghost module, while the second employs a linear projection (followed by BN) without ReLU. The second variant (stride = 2, Figure 5c) accommodates downsampling; a depthwise convolution (DWConv) layer is inserted between the two Ghost modules to enable cross-layer feature evolution. To maximize efficiency, the primary convolution within each Ghost module utilizes pointwise convolution for channel expansion.

To further reduce the model’s computational footprint and enhance its damage feature extraction efficiency, we developed a lightweight feature extraction architecture, termed the GBC3-FEA module (Figure 5d), by integrating Ghost Bottlenecks into the YOLOv5s C3 module. The C3 module’s multi-branch design with dense connections serves as a suitable foundation for this integration, enabling a significant reduction in computational complexity while preserving representational capacity. In our GBC3-FEA design, an initial standard convolutional layer first halves the input channel depth. The features are then processed through a sequential cascade of Ghost bottleneck layers and a residual branch. This dual-pathway architecture captures rich semantic information, which is subsequently merged via a concatenation operation. Finally, a convolutional layer refines the fused features to enhance contextual coherence.

The novelty of the proposed GBC3-FEA module lies in its integration of Ghost convolution and the C3 module from YOLOv5 into a unified lightweight feature extraction architecture, specifically designed for wind turbine blade damage detection. Unlike standard lightweight approaches such as MobileNet or ShuffleNet, which focus mainly on general-purpose efficiency, our module explicitly reduces computational cost while preserving multi-scale feature fusion capabilities through the C3 structure. Moreover, by embedding Ghost Bottlenecks—which use low-cost linear operations to generate “ghost” features—within the C3′s multi-branch layout, we achieve a better trade-off between accuracy and computational efficiency compared to conventional lightweight convolutional designs. This innovation is particularly valuable in the context of vision-based structural health monitoring, where real-time processing under hardware constraints is critical.

2.3.4. Attention Mechanism Introduction

In developing a damage detection model for wind turbine blades, a critical challenge is the precise identification of defects within complex and dynamically changing natural environments. To address this, we have conducted an in-depth study on methods to enhance the focus on blade damage characteristics, with the goal of comprehensively extracting latent damage features from inspected blades. Our approach integrates the Convolutional Block Attention Module (CBAM) [28] into the MSWindD-YOLO framework. By combining channel and spatial attention, CBAM dynamically recalibrates feature map weights across different dimensions, enabling the model to adaptively concentrate on regions critical for damage identification. This mechanism significantly enhances the sensitivity towards blade damage features while simultaneously improving robustness and accuracy against complex backgrounds. Consequently, our work contributes to the advancement of efficient and intelligent detection systems for wind turbine blade damage.

CBAM is an attention mechanism consisting of a Channel Attention Module (CAM) and a Spatial Attention Module (SAM). CBAM is designed to enhance the performance of Convolutional Neural Networks in tasks such as image recognition. As illustrated in Figure 6, the input features are sequentially processed first by the channel attention mechanism and subsequently by a mechanism for emphasizing informative spatial locations. This design enables the network to concentrate on salient features across both the channel and spatial dimensions.

The Channel Attention Module (CAM) is responsible for assigning varying weights to features across the channel dimension, with its architecture illustrated in Figure 7. Within the CAM, global average pooling (GAP) and global maximum pooling (GMP) are applied in parallel to each channel of the input feature map. These two pooling operations extract the global average and global maximum information of the feature map, respectively. Subsequently, the results of these pooling processes are fed into a shared multilayer perceptron (MLP) with hidden layers for further processing. Following summation, the MLP outputs are fed into a sigmoid activation function, yielding the final channel attention weights. These weights are utilized to adjust the channel-wise weights of the input feature map, thereby enhancing the model’s focus on critical channels.

Specifically, the processed output M_c(F) of the Channel Attention Module (CAM) can be mathematically formulated as Equation (5):

M_{c} (F) = σ (M L P (A v g P o o l (F)) + M L P (M a x P o o l (F))) = σ (W_{1} (W_{0} (F_{a v g}^{c})) + W_{1} (W_{0} (F_{\max}^{c})))

(5)

where

σ (\cdot)

denotes the sigmoid activation function, MLP represents a shared multi-layer perceptron with hidden layers, W₁ and W₂ correspond to the output layer weights and hidden layer weights within the MLP architecture, while

F_{a v g}^{c}

and

F_{\max}^{c}

respectively indicate the channel-wise global average pooling features and global max pooling features.

Although the Channel Attention Module (CAM) can effectively highlight the importance of different channels, it may overlook spatial positional information within the feature maps. To address this limitation, CBAM introduces a Spatial Attention Module (SAM) to complement its functionality. The architecture of the Spatial Attention Module is illustrated in Figure 8.

The Spatial Attention Module (SAM) generates a spatial attention mask by leveraging inter-spatial relationships of features. The module first applies both global average pooling (GAP) and global maximum pooling (GMP) along the channel dimension to the input feature map. This operation produces two 2D feature maps, each encoding a different type of global spatial context (average-pooled and max-pooled). These two maps are then concatenated along the channel axis to form a composite feature descriptor. Subsequently, this descriptor is processed by a standard convolutional layer with a 7 × 7 kernel, and subsequently passed through a sigmoid function to create the spatial weight map. This final weight map is applied to the input feature map via element-wise multiplication, effectively recalibrating it to emphasize semantically informative regions. The output

M_{s} (F^{’})

of the Spatial Attention Module can be formally expressed by Equation (6):

M_{s} (F^{’}) = σ (f^{7 \times 7} (A v g P o o l (F^{’}); M a x P o o l (F^{’})])) = σ (f^{7 \times 7} ([F_{a v g}^{’ s}; F_{\max}^{’ s}]))

(6)

where f^7×7 denotes a convolution operation with a 7 × 7 kernel, and

F_{a v g}^{’ s}

and

F_{\max}^{’ s}

represent the feature maps obtained by global average pooling and global maximum pooling, respectively.

2.3.5. Loss Function Redefinition

In target detection tasks, bounding box regression serves as a critical component within the detector’s localization branch, playing a pivotal role in precise object localization. While the CIoU loss function [29] adopted by traditional YOLOv5 models improves bounding box regression accuracy to some extent, its over-reliance on the aggregation of regression metrics overlooks the intrinsic attributes of bounding boxes, such as shape and scale characteristics. This limitation results in slower convergence rates and suboptimal detection efficiency during model training.

To address this challenge and improve both training efficiency and detection accuracy, we introduced a novel Shape-IoU loss function [30] integrated into the MSWindD-YOLO framework, replacing the conventional CIoU loss function. The Shape-IoU loss function redefines the penalty terms in the regression objective by emphasizing the inherent geometric properties of bounding boxes, particularly their shape and scale. This reparameterization enables more precise control during the bounding box regression stage. Specifically, the loss function comprises three key components: the Shape Distance cost function, the Angle cost function, and the IoU cost function.

The Shape Distance cost function primarily accounts for discrepancies between predicted and ground-truth bounding boxes along horizontal and vertical dimensions. By integrating the deviation in center coordinates with shape-dependent weighting factors derived from the ground-truth box geometry, this component accurately quantifies the extent of shape discrepancy between predicted and actual bounding boxes, as formulated in Equation (7):

{distance}^{s h a p e} = h h \times {(x_{c} - x_{c}^{g t})}^{2} / c^{2} + w w \times {(y_{c} - y_{c}^{g t})}^{2} / c^{2}

(7)

w w = \frac{2 \times {(w^{g t})}^{s c a l e}}{{(w^{g t})}^{s c a l e} + {(h^{g t})}^{s c a l e}}

(8)

h h = \frac{2 \times {(h^{g t})}^{s c a l e}}{{(w^{g t})}^{s c a l e} + {(h^{g t})}^{s c a l e}}

(9)

where (x_c, y_c) and

(x_{c}^{g t}, y_{c}^{g t})

denote the center coordinates of the predicted and ground-truth bounding boxes, respectively. The terms ww and hh are the shape-dependent weighting coefficients derived from the ground-truth box’s dimensions.

The Angle cost function is introduced as an extension based on the definition of the Shape Distance cost function, serving to quantify rotational discrepancies between predicted bounding boxes and ground-truth bounding boxes. The function is mathematically formulated in Equation (10), thereby further enhancing the accuracy of bounding box regression.

Ω^{s h a p e} = \sum_{t = w, h} {(1 - e^{- w_{t}})}^{θ}, θ = 4,

(10)

w_{w} = h h \times \frac{|w - w_{g t}|}{\max (w, w^{g t})}

(11)

w_{h} = w w \times \frac{|h - h^{g t}|}{\max (h, h^{g t})}

(12)

The IoU term in the Shape-IoU loss retains the standard formulation, as defined in Equation (13). It directly minimizes the discrepancy between the predicted and ground-truth bounding boxes by optimizing their Intersection over Union (IoU), thereby enhancing regression accuracy.

IoU = \frac{|b \cap b^{g t}|}{|b \cup b^{g t}|}

(13)

The aforementioned three functions collectively constitute the integrated operational framework of the Shape-IoU loss function, as illustrated in Figure 9. The definition of the Shape-IoU loss function can be explicitly formulated by Equation (14):

L_{s h a p e - IoU} = 1 - IoU + {distance}^{s h a p e} + 0.5 \times Ω^{s h a p e}

(14)

As indicated in Equation (14), the Shape-IoU loss function provides comprehensive optimization of the bounding box regression process by integrating multiple geometric factors, including shape, scale, and IoU. When applied to the MSWindD-YOLO model, this loss function significantly enhances both the training convergence speed and the final inference accuracy.

2.4. Experimental Platform and Parameter Configuration

The MSWindD-YOLO model was trained and evaluated on a high-performance workstation equipped with an NVIDIA GeForce RTX 4090 GPU (24 GB VRAM; NVIDIA Corporation, Santa Clara, CA, USA) and an Intel Core™ i9-7940X CPU (3.10 GHz; Intel Corporation, Santa Clara, CA, USA), running a 64-bit Windows 11 operating system. The input image size was standardized to 640 × 640 pixels, with 3 channels. This implementation was based on the PyTorch 2.0 framework and Python 3.8. Throughout training, a batch size of 8 was maintained for 200 epochs. Stochastic gradient descent (SGD) was applied, utilizing an initial learning rate of 0.01, a momentum coefficient of 0.948, and a weight decay of 0.0005. The network was initialized with pre-trained weights to accelerate convergence. To rigorously evaluate the model, we designed comprehensive control and ablation studies. All experiments shared the same hyperparameters and dataset splits to guarantee the reproducibility and statistical significance of our findings.

2.5. Edge Computing Equipment

Traditional desktop computing platforms are plagued by spatial inefficiency (bulky form factors) and poor energy proportionality, which makes them unsuitable for embedded architectures in unmanned aerial vehicle (UAV) systems. Specifically, their suboptimal design and high power consumption violate the stringent SWaP (Size, Weight, and Power) constraints critical for UAV platforms, thereby impeding the development of lightweight airborne intelligence systems. Furthermore, for image processing, traditional cloud computing’s centralized architecture faces two critical limitations: inherently fragile network links lead to unreliable data stability, while the end-to-end latency induced by long-distance communication creates a severe bottleneck for processing timeliness. This paradigm fundamentally conflicts with the dual requirements of real-time performance and reliability, with its shortcomings becoming acutely pronounced when handling high-resolution video streams or other latency-sensitive applications.

To address the technical constraints of traditional computing architectures, we introduced an edge computing paradigm. We constructed a network of distributed edge computing nodes to enable real-time parsing, feature extraction, and intelligent inference for data streams in wind turbine blade damage detection scenarios. This proximate deployment of computational resources substantially reduces end-to-end processing latency, thereby ensuring rapid and effective responses for blade health status evaluation. As illustrated in Figure 10, at the hardware level, this study employed the NVIDIA Jetson Orin NX (NVIDIA Corporation, Santa Clara, CA, USA) embedded computing device for model deployment; its key performance specifications are provided in Table 2. A SIYI-ZR10 3-axis stabilized gimbal camera was utilized as the imaging sensor on the unmanned aerial vehicle (UAV) for data acquisition.

The NVIDIA Jetson Orin NX is an edge computing platform that maintains a compact form factor while delivering up to 100 TOPS (Tera Operations Per Second) of computational performance. It is equipped with a comprehensive I/O set, including a PCIe Gen4 interface that provides double the bandwidth of the previous generation to meet industrial-grade expansion requirements. The platform is fully compatible with mainstream AI frameworks like PyTorch and TensorFlow, facilitating efficient deployment. The software environment is based on Ubuntu 20.04 LTS and the JetPack 5.0 SDK, which includes core components such as Python 3.8, CUDA 11.4, and TensorRT 8.0. For visual processing, the OpenCV 4.5.1 library leverages CUDA-accelerated GPU computing to substantially enhance image processing efficiency for real-time computer vision applications.

2.6. Evaluation Indictors

For a rigorous evaluation of the MSWindD-YOLO model’s detection performance on the WTBDVA dataset, multiple standard performance metrics were employed. These included precision, recall, mAP@0.5, mAP@0.5:0.95, F1-score, model size, and inference speed, together providing a thorough evaluation of the model’s accuracy and efficiency.

In object detection, precision is defined as the ratio of true positive detections to all positive predictions made by the model, quantifying the reliability of its positive predictions. Recall is the ratio of true positives to all actual positive instances in the dataset, measuring the model’s ability to identify all relevant objects. The mAP@0.5:0.95 metric evaluates overall detection quality by computing the average precision (AP) across multiple IoU thresholds from 0.5 to 0.95 (in steps of 0.05) and then averaging the AP values across all classes. This provides a comprehensive measure of performance across varying localization precision requirements. The F1-score, defined as the harmonic mean of precision and recall, balances these two performance aspects. Higher values for precision, recall, mAP@0.5, mAP@0.5:0.95, and the F1-score indicate superior detection accuracy. Beyond accuracy, efficiency metrics are also critical. Model size, determined by the number of parameters, reflects the architectural complexity and computational requirements. A smaller model facilitates deployment on resource-constrained edge devices. Inference speed is the time taken for the model to execute a forward pass to generate predictions from input data. Faster inference is essential for meeting real-time processing constraints.

3. Results

3.1. Ablation Experiment

We performed ablation studies on our self-collected WTBDVA dataset to evaluate the contributions of four key components: the EfficientNetV2 backbone, the GBC3-FEA feature extraction module, the CBAM attention mechanism, and the Shape-IoU loss function. Using YOLOv5s as the baseline, these modules were progressively integrated to quantify their individual and synergistic effects on detection accuracy and model size. The experimental configurations and results are detailed in Table 3. The final integrated model, MSWindD-YOLO, demonstrates exceptional performance, achieving a precision of 95.9%, a recall of 96.3%, an mAP@0.5 of 93.7%, and an mAP@0.5:0.95 of 87.5%. Crucially, it maintains this high accuracy with an extremely small model size of only 3.12 MB, striking an excellent balance between performance and efficiency.

Experimental results demonstrated that integrating the EfficientNetV2 module equipped with a compound scaling strategy into the backbone network of YOLOv5s achieves effective network reconstruction. This modification reduces the model size by 32.1% while improving precision by 4.9%, thereby validating the efficient utilization of computational resources through depth-width-resolution collaborative scaling. After integrating the GBC3-FEA module, the synergistic interaction between the channel compression strategy and feature fusion mechanism further reduces the model size by 16.9% while improving precision by 2.8%. This result validates that the module combination achieves effective lightweight design without compromising feature expression capability. Building upon this foundation, the introduction of the CBAM attention mechanism achieves a 22.8% reduction in model size while boosting accuracy by 3.5%, primarily through the global information aggregation and feature recalibration capabilities of its channel attention branch. This mechanism enhances the representational capacity of critical features by explicitly modeling channel-wise dependencies, thereby realizing a dual optimization of model accuracy and computational efficiency. Finally, replacing the traditional CIoU loss function with the Shape-IoU loss function results in a 2.3% improvement in precision. This enhancement reveals that the Shape-IoU loss function strengthens the geometric adaptability of bounding box similarity evaluation by introducing a diagonal distance metric, thereby more precisely optimizing the matching quality between predicted boxes and target boxes and effectively promoting the elevation of model detection performance. These experimental results demonstrate that when the four modules exert synergistic interactions within the network architecture, the model achieves an optimal balance between detection accuracy and computational efficiency. Not only does it attain the state-of-the-art detection accuracy, but the model size is also compressed to the minimum. This comprehensive improvement across multiple performance dimensions provides a solid technical foundation for realizing efficient deployment in resource-constrained edge computing environments.

Figure 11 depicts the evolutionary trajectory of our model from the baseline to the final optimized architecture, comparing the detection accuracy (across four damage types) for models at each ablation stage. As the baseline, YOLOv5s exhibits relatively low mAP values for all damage types. It particularly struggles to detect concealed cracks, significantly underperforming all enhanced variants. This suggests that the baseline architecture fails to adequately capture fine-grained damage features. The progressive integration of each component (EfficientNetV2, GBC3-FEA, CBAM, Shape-IoU) yields substantial and consistent improvements in detection accuracy across all damage categories. These gains are most pronounced in the model’s generalization capability for subtle damage patterns under complex background conditions. The final, systematically optimized MSWindD-YOLO model demonstrates optimal comprehensive performance for the wind turbine blade damage detection task, offering an efficient and reliable solution for practical engineering applications.

The analytical results confirm that the MSWindD-YOLO model strikes an optimal balance between high-precision detection and a compact, efficient architecture. As shown in Figure 12, the model demonstrates excellent multi-scale detection capability. It maintains stable accuracy across damages of varying morphologies, with minimal misses and false positives, underscoring its reliability in practical applications. The quantitative data in Table 4 systematically support this, showing that the model achieves high recognition accuracy across diverse damage categories, even in complex scenarios.

To rigorously quantify the MSWindD-YOLO model’s performance in identifying various types of wind turbine blade damage, we plotted Precision-Recall (PR) curves for each damage category based on its performance on the test set (Figure 13). The Average Precision (AP) for each category, which is defined as the area under its PR curve, was calculated by applying a threshold of 0.5 to the IoU. Figure 13 shows that the AP values for the different damage types are very high and exhibit minimal variance. This indicates that the model performs consistently well across all damage categories, demonstrating its strong generalization capability and robustness in handling diverse damage scales and morphologies. These results validate the effectiveness of the model’s architectural design.

3.2. Comparative Experiments with Mainstream Object Detection Algorithms

To conduct a precise evaluation of the detection performance of the MSWindD-YOLO model developed in this study, we performed comparative experiments with four representative object detection models: YOLOv8s [31], YOLOv9c [32], DINO-DETR [33], and DiffusionDet [34]. The experiments focused on investigating how each model performed on the given task of wind turbine blade damage detection. All experiments were conducted using the designated test set and executed under strictly unified hardware platforms, software environments, and hyperparameter configurations, ensuring the objectivity and credibility of the comparative results. Detailed detection performance metrics are illustrated in Figure 14. The DiffusionDet model (Figure 14e) demonstrates favorable performance in detecting blade damages with typical morphological characteristics, achieving precise localization. However, when encountering atypical features such as irregular crack propagation, early-stage hidden defects within the same damage category, and multi-scale variations, the model exhibits missed detections for small-sized damage targets. This limitation arises from the insufficient receptive field adaptability of its feature encoding network. By contrast, YOLOv8s (Figure 14b) and DINO-DETR (Figure 14d) demonstrated certain advantages in the detection of complex damages; however, both exhibited notable deficiencies in detecting damages within insufficiently illuminated and shadow-covered regions, failing to effectively capture the damage features in these areas. YOLOv9c (Figure 14c) demonstrates further progress in reducing missed detection rates across various damage types. However, it still exhibits limitations in handling objects with minor damage severity and in low-light conditions, primarily due to insufficient feature decoupling during feature extraction, resulting in a small number of missed detections for subtle damage objects. Ultimately, the detection results of MSWindD-YOLO (Figure 14f) are highly satisfactory. This model demonstrates exceptional hierarchical feature representation capability in cross-scale damage detection, efficiently handling damage targets across diverse scales while effectively addressing challenges such as damage severity variations and fluctuations in lighting conditions. It effectively mitigates the widespread performance degradation issues of traditional methods in complex environments, thereby providing a more reliable and efficient new solution for wind turbine blade damage detection.

Figure 15 presents the performance trends of the evaluation metrics (precision, recall, mAP@0.5, and mAP@0.5:0.95) across various detection models under the same number of training iterations. Analysis results demonstrate that, in comparison to YOLOv8s, YOLOv9c, DINO-DETR, and DiffusionDet models, the proposed MSWindD-YOLO model exhibits superior convergence characteristics and achieves the optimal performance values. Table 5 presents the detailed numerical values of the performance parameters for each model, where MSWindD-YOLO demonstrates superiority over the other models across all four core metrics mentioned above. Specifically, in terms of precision, the improvement margins of MSWindD-YOLO relative to YOLOv8s, YOLOv9c, DINO-DETR, and DiffusionDet reach 4.47%, 2.24%, 3.56%, and 4.92%, respectively. Regarding target detection capability, the recall rates of MSWindD-YOLO exceed those of the four baseline models by 4.45%, 2.34%, 3.22%, and 4.9%, respectively. Under the IoU threshold of 0.5, the mAP value of MSWindD-YOLO is enhanced by 1.4%, 1.19%, 0.32%, and 1.63% compared to the baseline models. In the rigorous evaluation framework with an IoU threshold range of 0.5:0.95, the advantages of MSWindD-YOLO’s mAP value further expand to 2.1%, 1.39%, 0.69%, and 2.46%. These quantitative analyses indicate that MSWindD-YOLO not only maintains detection accuracy but also significantly enhances target localization capability, while exhibiting more prominent advantages in comprehensive detection performance under complex scenarios.

Table 5 also presents a systematic quantitative comparison of MSWindD-YOLO against four baseline models (YOLOv8s, YOLOv9c, DINO-DETR, and DiffusionDet) across four key metrics: F1-score, model size, inference efficiency, and computational resource consumption. Experimental results demonstrate that the proposed framework achieves multi-objective optimization while maintaining detection robustness. Specifically, MSWindD-YOLO achieves absolute improvements in F1-score of 4.35%, 2.24%, 3.11%, and 4.8% over YOLOv8s, YOLOv9c, DINO-DETR, and DiffusionDet, respectively, indicating superior feature representation capabilities in dense small-object detection scenarios. From the perspective of model complexity, MSWindD-YOLO exhibits exceptional lightweight design. Compared to the baseline models, its parameter count is reduced by up to 83.6% while maintaining a compact architecture even under high-resolution inputs, significantly enhancing its deployment feasibility in resource-constrained environments. In terms of real-time performance, MSWindD-YOLO achieves an inference speed of 0.5 FPS higher than YOLOv8s and 4.2 FPS higher than YOLOv9c. Compared to Transformer-based models (DINO-DETR and DiffusionDet), the speed advantages expand substantially to 23.5 FPS and 19.5 FPS, respectively. These results validate that the proposed model achieves significant inference acceleration while retaining high detection performance, enabling its effective application in real-time wind turbine blade inspection. Furthermore, MSWindD-YOLO shows notable efficiency in computational resource consumption. It reduces FLOPs by 6.2 GFLOPs compared to YOLOv8s, 13.8 GFLOPs compared to YOLOv9c, and by 114.3 GFLOPs and 81.0 GFLOPs compared to DINO-DETR and DiffusionDet, respectively.

Figure 16 uses a radar chart to provide a multi-dimensional performance comparison between the proposed MSWindD-YOLO model and four benchmark models: YOLOv8s, YOLOv9c, DINO-DETR, and DiffusionDet. The results demonstrate that MSWindD-YOLO, enhanced through its lightweight design, demonstrates robust performance in complex environment detection tasks. It achieves significant model size compression while maintaining high detection accuracy and real-time inference speed. Furthermore, this balanced optimization across accuracy, efficiency, and computational demands results in a comprehensive performance advantage for deployment on computationally limited edge platforms, which validates the effectiveness of the proposed model refinement strategy.

3.3. Model Deployment and Real-Time Detection

The study employed TensorRT, a high-performance deep learning inference SDK from NVIDIA, to accelerate model deployment. By utilizing core techniques such as operator fusion, precision calibration, and dynamic tensor memory management, TensorRT optimizes inference on NVIDIA GPU platforms, delivering low-latency and high-throughput performance for various deep learning frameworks. This approach significantly improves deployment efficiency in edge computing scenarios.

To systematically evaluate the model’s practical performance in edge scenarios, an end-to-end deployment scheme was designed and implemented on an NVIDIA Jetson Orin NX platform. As illustrated in Figure 17, the workflow proceeds as follows: The model, trained in PyTorch, is first converted to ONNX format to guarantee framework interoperability. The TensorRT parser then constructs an optimized computation graph on the edge device and generates a high-performance inference engine via compilation techniques such as operator fusion, kernel auto-tuning, and memory allocation optimization. The final optimized engine is serialized into a file for persistent storage. During runtime, the engine is deserialized and loaded to perform low-latency inference on real-time images from a camera. This streamlined pipeline provides an efficient and reliable solution for edge intelligence applications, demonstrating full-stack optimization from training to deployment.

During the training phase of deep neural networks, the 32-bit Single-Precision Floating-Point format (FP32) is conventionally employed to ensure numerical stability and precision. However, computational efficiency during deployment and inference can be significantly enhanced through low-precision quantization. This study utilized the NVIDIA TensorRT framework to evaluate the quantization performance of the MSWindD-YOLO model on an NVIDIA Jetson Orin NX platform. As shown in Figure 18, TensorRT optimization achieved a 104.8% increase in the inference frame rate under FP16 precision over the PyTorch baseline, rising from 21 FPS to 43 FPS. Furthermore, under INT8 quantization, the performance gain was even more substantial, with a 147.6% increase over the baseline to 52 FPS. Notably, the INT8 scheme significantly reduces memory bandwidth requirements and computational complexity without compromising accuracy, enabling a further 20.9% improvement in real-time performance compared to the FP16 mode. These results conclusively validate the efficacy of low-precision quantization for real-time detection on edge computing platforms.

To comprehensively address the deployment efficiency beyond frame rate, we monitored the resource utilization of the MSWindD-YOLO model on the Jetson Orin NX under continuous load. When running at the INT8 precision with an average throughput of 52 FPS, the model maintains a stable GPU utilization of approximately 78% and a CPU utilization of around 35%. The total video memory consumption is 2.1 GB, and the power consumption averages 18.5 Watts. The end-to-end latency per frame is 19.2 milliseconds which can be broken down into 2.1 milliseconds for data preprocessing, 15.2 milliseconds for model inference, and 1.9 milliseconds for result post-processing. These metrics collectively confirm the model’s high efficiency and stability, demonstrating its suitability for long-duration, real-time inspection tasks on resource-constrained edge platforms.

Figure 19 demonstrates the engineering deployment effect of the MSWindD-YOLO model on the NVIDIA Jetson Orin NX edge computing platform. Deployed for time-segmented all-weather inspections at the Guanting Wind Farm, the model enables dynamic monitoring of wind turbine blade surface damages through a real-time intelligent inspection system. Experimental results confirm that the model maintains exceptional recognition accuracy under dual challenges of subtle damage features and complex lighting variations, achieving precise detection and stable identification of blade surface damages. These findings provide robust evidence of the model’s exceptional reliability and effectiveness in real-world operational scenarios.

The wind turbine blade damage detection model (MSWindD-YOLO) proposed in this study offers an innovative solution for the full-life cycle health management of wind turbine equipment. Its capability for precise identification of damage effectively reduces the unplanned downtime of wind turbines. Furthermore, it facilitates an integrated intelligent operation and maintenance (O&M) system that unifies the functions of “monitoring, warning, and evaluation”. This provides a critical technical foundation for the wind power industry to transition equipment maintenance from a scheduled to a predictive paradigm. The study demonstrates substantial academic innovation and engineering practicality, effectively bridging the gap between theoretical research and industrial application in wind turbine maintenance.

4. Discussion

In response to the challenges posed by identifying wind turbine blade damage under operational conditions, we propose MSWindD-YOLO, a lightweight real-time detection model. Our approach achieves real-time, high-precision identification of blade surface damage in complex environments. Comprehensive experimental evaluations validate its superior performance across all key metrics: a precision of 95.9%, recall of 96.3%, mAP@0.5 of 93.7%, and mAP@0.5:0.95 of 87.5%. Notably, the model maintains a compact size of 3.12 MB and a computational cost of 22.4 GFLOPs, while achieving an inference speed of 43 frames per second (FPS) on the NVIDIA Jetson Orin NX edge platform. This solution effectively resolves critical technical challenges in existing deep learning methods, including the accuracy-efficiency trade-off, environmental adaptability, and edge deployment feasibility, thereby paving the way for reliable operational monitoring of wind power assets.

Compared to two-stage detection algorithms (e.g., Faster R-CNN and Mask R-CNN), our single-stage MSWindD-YOLO maintains high precision while achieving significantly superior inference efficiency. For example, the MSE-Faster R-CNN model proposed by Shi et al. [11] achieved a detection accuracy of 80.6% for blade crack detection. However, its high computational cost and slow inference speed limit its applicability for real-time UAV-based inspection. In contrast, MSWindD-YOLO incorporates lightweight design strategies—such as replacing the C3 module with the GBC3-FEA module—which significantly reduce the model complexity while improving the detection accuracy by over 15 percentage points compared to MSE-Faster R-CNN. This enhancement enables practical deployment on resource-constrained edge devices for industrial applications.

In comparison to other enhanced YOLO-based models, MSWindD-YOLO achieves a superior balance between detection accuracy and computational efficiency. For instance, the AFB-YOLO model proposed by Ran et al. [16] reported a detection accuracy of 83.7% for blade defects. In contrast, by employing the Shape-IoU loss function to optimize bounding box regression, our model achieves a significantly higher precision of 95.9%. Although the lightweight YOLOX-based model developed by Yao et al. [17] attained a high mAP of 94.29%, it was not explicitly optimized for the multi-scale nature of wind turbine blade damage. This limitation resulted in a relatively lower recall rate (approximately 89%) for small-scale defects such as hairline cracks. Our model addresses this issue through a multi-scale feature fusion strategy integrating the Stem and SPPF modules, achieving a notably higher recall of 96.3% and effectively reducing missed detections of small targets. Furthermore, compared to the YOLO-Wind model developed by Zhao et al. [18]—which achieves an mAP@0.5 of 83.9%—MSWindD-YOLO demonstrates superior performance across multiple metrics. This enhancement is primarily attributed to our advanced multi-scale feature processing. Specifically, replacing the Focus module with the Stem module in the backbone network strengthens early-stage multi-scale feature fusion, enabling more effective capture of diverse damage features. Moreover, while the model in [18] exhibits limitations in lightweight design, our approach incorporates the EfficientNetV2 architecture into the backbone and replaces the standard C3 module with the lightweight GBC3-FEA module in the neck network. This design not only improves performance but also significantly reduces computational overhead, enabling efficient inference on edge devices. These optimizations make MSWindD-YOLO exceptionally suitable for practical UAV-based inspection scenarios where computational efficiency and real-time processing are paramount.

The MSWindD-YOLO model proposed in this study has broad implications for addressing the challenge of wind turbine blade damage detection, with its value spanning multiple critical areas including engineering applications, technological innovation, and energy strategy. From an engineering perspective, the MSWindD-YOLO model provides a lightweight approach for the automated detection of wind turbine blades. Its edge deployment capability (43 FPS on NVIDIA Jetson Orin NX) meets the real-time requirements of UAV operations, mitigating the high costs and risks associated with traditional manual inspections. Furthermore, it enhances the coverage and timeliness of blade damage detection, thereby contributing to the safe and economical lifecycle operation of wind farms. From a technological innovation perspective, the improvement strategies proposed in this study offer valuable insights for designing lightweight object detection models: (1) Replacing the Focus module with the Stem module reduces computational complexity while enhancing early multi-scale feature fusion capability; (2) Integrating the GBC3-FEA module with the CBAM attention mechanism achieves a synergistic improvement in both feature extraction efficiency and key feature response intensity; (3) Adopting the Shape-IoU loss function effectively mitigates bounding box localization deviations in complex backgrounds, providing a useful reference for loss function design in similar industrial inspection scenarios. From an energy strategy perspective, the deployment of the MSWindD-YOLO model can promote the intelligent upgrading of wind power operations and maintenance (O&M). By reducing O&M costs for the wind power industry, it accelerates the displacement of traditional energy sources with clean energy, thereby providing crucial technical support for global energy structure transformation.

The wind turbine blade damage detection methodology proposed in the research effectively balances model simplicity with superior performance, which is pivotal for developing intelligent prevention and control systems for wind power equipment. The research findings in this study are subject to limitations, primarily in two aspects. Firstly, the model’s high performance, achieved with a dataset expanded through augmentation from a base of 800 original images, warrants a discussion on the potential risk of overfitting. Although techniques such as 5-fold cross-validation and extensive data augmentation (e.g., rotation, noise addition) were employed to mitigate this risk, the fundamental reliance on a single, region-specific dataset means that the model’s generalizability to entirely new data domains requires further validation with larger, more diverse datasets. Secondly, the WTBDVA dataset, while encompassing four common damages, is predominantly sourced from the northern region of China and lacks representation from offshore wind power settings characterized by high humidity and salt fog. Consequently, further validation of the model’s robustness in these extreme environments is warranted. Thirdly, the current model is restricted to static image detection and does not incorporate temporal analysis for predictive maintenance. Nonetheless, these limitations are outweighed by the overall practical benefits of the MSWindD-YOLO model, establishing it as a viable solution for real-time monitoring of wind turbine blade conditions.

Our ongoing research aims to further enhance the MSWindD-YOLO model by exploring additional technologies. Subsequent studies will explore three primary avenues: Firstly, expanding the dataset to include various damage types like lightning damage and structural cracking, as well as diverse scenarios such as offshore and plateau wind power. This will involve simulating extreme environmental conditions and utilizing data enhancement techniques to improve the model’s ability to generalize across different scenarios. Secondly, integrating video sequence analysis methods, absorbing ultrasonic identification technology for damage characterization [35,36,37], and implementing a time series feature fusion mechanism to accurately predict damage evolution and provide quantitative support for preventive maintenance decisions. Lastly, incorporating multi-modal data such as infrared images and acoustic signals to create a comprehensive detection framework that enhances the identification of damage features in complex environments, thereby increasing the reliability and robustness of detection outcomes.

Although the MSWindD-YOLO model is currently still in the phase of continuous optimization and iteration, on-site testing and evaluation conducted under the existing technological conditions demonstrate that utilizing an UAV to complete an inspection of the blades of a wind turbine generator requires only 15–20 min, representing a time reduction of 40–45 min compared to traditional manual inspection methods. If UAV swarms are further employed for collaborative operations, inspection efficiency can be further enhanced, thereby significantly reducing the costs associated with manual inspections and showcasing substantial economic benefits and application potential.

In summary, the MSWindD-YOLO model successfully resolves the critical accuracy-efficiency trade-off in wind turbine blade damage detection. Its findings deliver a practical solution for wind power O&M and offer valuable insights for advancing lightweight object detection in industrial scenarios.

5. Conclusions

In conclusion, to solve the critical challenges in wind turbine blade damage detection—namely, environmental adaptability, the accuracy-efficiency trade-off, and real-time performance—we have developed MSWindD-YOLO, a novel high-precision lightweight model. Based on the YOLOv5s architecture, this model integrates three key innovations: (1) a backbone enhanced with a Stem module and EfficientNetV2 for efficient multi-scale feature fusion and extraction; (2) a neck network incorporating a lightweight GBC3-FEA module and a CBAM attention mechanism to reduce computational complexity while amplifying critical features; and (3) the Shape-IoU loss function to improve training convergence and localization accuracy. The model’s efficacy is validated through comparisons with several state-of-the-art models. Furthermore, the optimized MSWindD-YOLO is deployed on an NVIDIA Jetson Orin NX using TensorRT, and its performance is successfully demonstrated in real-world wind turbine inspections.

MSWindD-YOLO is different from other models because it combines a lightweight design, excellent detection accuracy, and the ability to be deployed at the edge. The MSWindD-YOLO model achieves a precision of 95.9%, a recall of 96.3%, an mAP@0.5 of 93.7%, and an mAP@0.5:0.95 of 87.5%, demonstrating high-precision recognition capabilities that align with practical detection requirements. With a compact model size of only 3.12 MB and a computational cost of 22.4 GFLOPs, it is suitable for efficient deployment on edge devices. Compared to models such as YOLOv8s, YOLOv9c, DINO-DETR, and DiffusionDet, MSWindD-YOLO exhibits superior performance while achieving state-of-the-art lightweight efficiency for wind turbine blade damage detection. Field deployment tests in real-world wind power generation scenarios show that the model achieves a real-time detection speed of 43 frames per second (FPS) at FP16 precision, meeting the real-time requirements of edge computing applications.

The wind turbine blade damage detection method proposed in this study navigates the trade-off between model complexity and performance effectively. The detection results can directly inform maintenance planning and spare parts management for wind turbine equipment, thereby reducing failure risks and operational expenditures (OPEX). This approach facilitates a shift from reactive maintenance to proactive, preventive strategies in wind turbine operation and maintenance (O&M). Consequently, it provides a critical technological foundation for developing intelligent, cost-effective, and highly reliable O&M systems for wind farms, supporting the broader goals of efficiency, safety, and sustainability in the renewable energy sector.

Author Contributions

Conceptualization, P.L. and J.Z. (Jitao Zhou); methodology, Q.Z.; software, J.Z. (Jitao Zhou); validation, P.L. and J.Z. (Jian Zeng); formal analysis, P.L. and Q.Z.; investigation, P.L. and Q.Y.; resources, P.L.; data curation, J.Z. (Jitao Zhou); writing—original draft preparation, P.L.; writing—review, and editing, P.L. and Q.Y.; visualization, P.L.; supervision, J.Z. (Jian Zeng); project administration, P.L.; funding acquisition, P.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Self Funded Project of Langfang Science and Technology Plan (grant number: 2024011078), the Youth Fund Project for Scientific Research of Higher Education Institutions in Hebei Province (grant number: QN2025015), and the 14th Five Year Plan Project for Higher Education Science Research of Hebei Higher Education Association (grant number: GJXHZ2024-30).

Data Availability Statement

The raw data essential to reproduce these findings remain non-public for the duration of the study as they are incorporated within an ongoing research project.

Acknowledgments

We extend our appreciation to Beijing Energy Group Co., Ltd. and Shanghai Jiao Tong University for their continuous support of our research, as well as the editors and reviewers for their careful review and valuable suggestions on this article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Bolson, N.; Prieto, P.; Patzek, T. Capacity factors for electrical power generation from renewable and nonrenewable sources. Proc. Natl. Acad. Sci. USA 2022, 119, e2205429119. [Google Scholar] [CrossRef]
Ullah, S.; Lin, B. Clean energy initiatives in Pakistan: Driving sustainable development and lowering carbon emissions through environmental innovation and policy optimization. J. Environ. Manag. 2025, 373, 123647. [Google Scholar] [CrossRef]
Chen, Y.; Xu, J. Solar and wind power data from the Chinese State Grid Renewable Energy Generation Forecasting Competition. Sci. Data 2022, 9, 577. [Google Scholar] [CrossRef]
Spiru, P.; Simona, P.L. Wind energy resource assessment and wind turbine selection analysis for sustainable energy production. Sci. Rep. 2024, 14, 10708. [Google Scholar] [CrossRef]
Solarin, S.A.; Bello, M.O. Wind energy and sustainable electricity generation: Evidence from Germany. Environ. Dev. Sustain. 2022, 24, 9185–9198. [Google Scholar] [CrossRef]
Tong, L.; Fan, C.; Peng, Z.; Wei, C.; Sun, S.; Han, J. WTBD-YOLOv8: An Improved Method for Wind Turbine Generator Defect Detection. Sustainability 2024, 16, 4467. [Google Scholar] [CrossRef]
Wang, Y.; Wang, X.; Yang, D.; Ru, X.; Zhang, Y. Defect identification of fan blade based on adaptive parameter region growth algorithm. Sci. Rep. 2025, 15, 851. [Google Scholar] [CrossRef]
Liu, L.; Li, P.; Wang, D.; Zhu, S. A wind turbine damage detection algorithm designed based on YOLOv8. Appl. Soft Comput. 2024, 154, 111364. [Google Scholar] [CrossRef]
Su, Y.; Li, D.; Chen, X. Lung Nodule Detection based on Faster R-CNN Framework. Comput. Methods Programs Biomed. 2021, 200, 105866. [Google Scholar] [CrossRef] [PubMed]
He, K.; Gkioxari, G.; Dollar, P.; Girshick, R. Mask R-CNN. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 386–397. [Google Scholar] [CrossRef] [PubMed]
Shi, L.; Long, Y.; Wang, Y.; Chen, X.; Zhao, Q. Evaluation of internal cracks in turbine blade thermal barrier coating using enhanced multi-scale Faster R-CNN model. Appl. Sci. 2022, 12, 6446. [Google Scholar] [CrossRef]
Zhang, J.; Cosma, G.; Watkins, J. Image enhanced mask R-CNN: A deep learning pipeline with new evaluation measures for wind turbine blade defect detection and classification. J. Imaging 2021, 7, 46. [Google Scholar] [CrossRef] [PubMed]
Diaz, P.; Tittus, P. Fast detection of wind turbine blade damage using Cascade Mask R-DSCNN-aided drone inspection analysis. Signal Image Video Process. 2023, 17, 2333–2341. [Google Scholar] [CrossRef]
Liu, Q.; Dong, L.; Zeng, Z.; Zhu, W.; Zhu, Y.; Meng, C. SSD with multi-scale feature fusion and attention mechanism. Sci. Rep. 2023, 13, 21387. [Google Scholar] [CrossRef] [PubMed]
Hnewa, M.; Radha, H. Integrated Multiscale Domain Adaptive YOLO. IEEE Trans. Image Process 2023, 32, 1857–1867. [Google Scholar] [CrossRef]
Ran, X.; Zhang, S.; Wang, H.; Zhang, Z. An improved algorithm for wind turbine blade defect detection. IEEE Access 2022, 10, 122171–122181. [Google Scholar] [CrossRef]
Yao, Y.; Wang, G.; Fan, J. WT-YOLOX: An Efficient Detection Algorithm for Wind Turbine Blade Damage Based on YOLOX. Energies 2023, 16, 3776. [Google Scholar] [CrossRef]
Zhao, Z.; Li, T. Enhancing wind turbine blade damage detection with YOLO-Wind. Sci. Rep. 2025, 15, 18667. [Google Scholar] [CrossRef]
Wang, Z.; Hu, Y.; Ding, J.; Shi, P. YOLOv5-Based Seabed Sediment Recognition Method for Side-Scan Sonar Imagery. J. Ocean Univ. China 2023, 22, 1529–1540. [Google Scholar] [CrossRef]
Kumar, S.; Singh, S.K.; Varshney, S.; Singh, S.; Kumar, P.; Kim, B.-G.; Ra, I.-H. Fusion of Deep Sort and Yolov5 for Effective Vehicle Detection and Tracking Scheme in Real-Time Traffic Management Sustainable System. Sustainability 2023, 15, 16869. [Google Scholar] [CrossRef]
Lu, F.; Li, K.; Nie, Y.; Tao, Y.; Yu, Y.; Huang, L.; Wang, X. Object Detection of UAV Images from Orthographic Perspective Based on Improved YOLOv5s. Sustainability 2023, 15, 14564. [Google Scholar] [CrossRef]
Liu, X.; Wang, D.; Wang, R.; Hu, B.; Wang, J.; Liu, Y.; Wang, C.; Guo, J.; Yang, S.; Nie, C.; et al. Integrating progressive screening strategy-based continuous wavelet transform with EfficientNetV2 for enhanced near-infrared spectroscopy. Talanta 2025, 284, 127188. [Google Scholar] [CrossRef] [PubMed]
Zheng, Z.; Zhao, J.; Fan, J.; Bai, R.; Zhao, J.; Liu, J. A complex roadside object detection model based on multi-scale feature pyramid network. Sci. Rep. 2025, 15, 15992. [Google Scholar] [CrossRef]
Gupta, S.; Tan, M. Efficientnet-edgetpu: Creating accelerator-optimized neural networks with automl. arXiv 2019, arXiv:2104.00298. [Google Scholar]
Kaya, Y.; Gürsoy, E. A MobileNet-based CNN model with a novel fine-tuning mechanism for COVID-19 infection detection. Soft Comput. 2023, 27, 5521–5535. [Google Scholar] [CrossRef]
Abu Al-Haija, Q. Leveraging ShuffleNet transfer learning to enhance handwritten character recognition. Gene Expr. Patterns 2022, 45, 119263. [Google Scholar] [CrossRef] [PubMed]
Duman, B. A Real-Time Green and Lightweight Model for Detection of Liquefied Petroleum Gas Cylinder Surface Defects Based on YOLOv5. Appl. Sci. 2025, 15, 458. [Google Scholar] [CrossRef]
Woo, S.; Park, J.; Lee, J.; Kweon, I.S. CBAM: Convolutional Block Attention Module. arXiv 2018, arXiv:1807.06521. [Google Scholar] [CrossRef]
Liang, C.; Yan, Z.; Ren, M.; Wu, J.; Tian, L.; Guo, X.; Li, J. Improved YOLOv5 infrared tank target detection method under ground background. Sci. Rep. 2023, 13, 6269. [Google Scholar] [CrossRef]
Ma, Q.; Jin, S.; Bian, G.; Cui, Y. Multi-Scale Marine Object Detection in Side-Scan Sonar Images Based on BES-YOLO. Sensors 2024, 24, 4428. [Google Scholar] [CrossRef]
Xu, W.; Li, X.; Ji, Y.; Li, S.; Cui, C. BD-YOLOv8s: Enhancing bridge defect detection with multidimensional attention and precision reconstruction. Sci. Rep. 2024, 14, 18673. [Google Scholar] [CrossRef]
Bumbálek, R.; Umurungi, S.; Ufitikirezi, J.; Zoubek, T.; Kuneš, R.; Stehlík, R.; Lin, H.; Bartoš, P. Deep learning in poultry farming: Comparative analysis of Yolov8, Yolov9, Yolov10, and Yolov11 for dead chickens detection. Poult. Sci. 2025, 104, 105440. [Google Scholar] [CrossRef]
Li, Z.; Zhang, J.; Zhang, Y.; Yan, D.; Zhang, X.; Dong, W. FSDN-DETR: Enhancing Fuzzy Systems Adapter with DeNoising Anchor Boxes for Transfer Learning in Small Object Detection. Mathematics 2025, 13, 287. [Google Scholar] [CrossRef]
Fan, J.; Zhang, E.; Wei, Y.; Wang, Y.; Xia, J.; Liu, J.; Liu, X.; Ma, S. DDOWOD: DiffusionDet for open-world object detection. Pattern Recognit. Lett. 2024, 186, 170–177. [Google Scholar] [CrossRef]
Li, X.; Hu, Y.; Jie, Y.; Zhao, C.; Zhang, Z. Dual-Frequency Lidar for Compressed Sensing 3D Imaging Based on All-Phase Fourier Transform. J. Opt. Photonics Res. 2023, 1, 74–81. [Google Scholar] [CrossRef]
Hu, K.; Chen, Z.; Kang, H.; Tang, Y. 3D vision technologies for a self-developed structural external crack damage recognition robot. Autom. Constr. 2024, 159, 105262. [Google Scholar] [CrossRef]
Yao, M.; Chen, Z.; Li, J.; Guan, S.; Tang, Y. Ultrasonic identification of CFST debonding via A novel Bayesian Optimized-LSTM network. Mech. Syst. Signal Process. 2025, 238, 113175. [Google Scholar] [CrossRef]

Figure 1. The 5-fold crossing-validation process.

Figure 2. MSWindD YOLO model structure diagram.

Figure 3. Structure diagrams of MBConv and Fused-MBConv modules: (a) MBConv module; (b) Fused-MBConv module.

Figure 4. Convolution process of Ghost Module.

Figure 5. Architectures of Ghost, Ghost Bottleneck, and GBC3-FEA modules: (a) Ghost; (b) Ghost Bottleneck with stride = 1; (c) Ghost Bottleneck with stride = 2; (d) GBC3-FEA.

Figure 6. CBAM structure diagram.

Figure 7. Structure diagram of Channel Attention Module.

Figure 8. Structure diagram of Spatial Attention Module.

Figure 9. Shape-IoU loss function.

Figure 10. Embedded edge computing device: (a) NVIDIA Jetson Orin NX; (b) SIYI-ZR10 3-axis stabilized gimbal camera.

Figure 11. Detection accuracy by damage type and model stage.

Figure 12. Detection effect of MSWindD-YOLO model: (a) surface crack; (b) surface attachment; (c) surface corrosion; (d) mechanical damage.

Figure 13. Precision-Recall Curves of MSWindD-YOLO model.

Figure 14. Comparison of detection effect of different models: (a) original images; (b) YOLOv8s; (c) YOLOv9c; (d) DINO-DETR; (e) DiffusionDet; (f) MSWindD-YOLO.

Figure 15. Comparison of the training curves of different models.

Figure 16. The radar chart for multidimensional performance evaluation.

Figure 17. Operation process of edge device deployment scheme.

Figure 18. Real-time detection speed on edge computing device.

Figure 19. Real-time detection visualization on edge device screen.

Table 1. Composition of the dataset (WTBDVA).

Wind Turbine Blade Damage Category	Number of Original Samples	Number of Augmented Samples	Total Number of Samples	Number of Samples in Training Sets	Number of Samples in Validation Sets	Number of Samples in Test Sets
Surface crack	224	896	1120	896	112	112
Surface attachment	212	848	1060	848	106	106
Surface corrosion	236	944	1180	944	118	118
Mechanical damage	218	872	1090	872	109	109
Total	800	3650	4450	3560	445	445

Table 2. NVIDIA Jetson Orin NX equipment parameters.

Performance Parameter	Technical Specifications
CPU	8-core Arm^®Cortex^®-A78AE v8.2 64-bit CPU 2 MB L2 + 4 MB L3
GPU	1024-core NVIDIA Ampere architecture GPU with 32 Tensor Cores
Video memory	16 GB 128-bit LPDDR5 102.4 GB/s
Storage	16 GB eMMC + 256 GB SSD
Camera	8 lanes MIPI CSI-2 D-PHY 2.1 Up to 4 cameras
Power dissipation	10–25 W
AI performance	100 TOPS (INT8)
Deep learning accelerator	2× NVDLA v2
Visual accelerator	1× PVA v2

Table 3. Configurations and key results of the ablation experiments.

YOLOv5s	EfficientNetV2	GBC3-FEA	CBAM	Shape-IoU	Precision /%	Recall /%	mAP@0.5 /%	mAP@0.5:0.95 /%	Model Size/MB
√	-	-	-	-	82.4	80.2	77.6	69.8	7.23
√	√	-	-	-	87.3	85.9	83.6	74.2	4.91
√	√	√	-	-	90.1	89.7	88.9	77.6	4.08
√	√	√	√	-	93.6	92.3	91.4	84.2	3.15
√	√	√	√	√	95.9	96.3	93.7	87.5	3.12

Table 4. MSWindD-YOLO model detection performance data for different blade damages.

Wind Turbine Blade Damage	Precision /%	Recall /%	mAP@0.5:0.95 /%	F1-Score /%	IoU
Surface crack	97.2	95.8	88.3	96.5	0.89
Surface attachment	95.1	96.7	86.5	95.9	0.87
Surface corrosion	96.8	97.3	89.2	97.0	0.90
Mechanical damage	94.5	95.2	85.8	94.8	0.86

Table 5. Performance parameters of different models.

Performance Parameter	YOLOv8s	YOLOv9c	DINO-DETR	DiffusionDet	MSWindD-YOLO
precision/%	91.8	93.8	92.6	91.4	95.9
recall/%	92.2	94.1	93.3	91.8	96.3
mAP@0.5/%	92.4	92.6	93.4	92.2	93.7
mAP@0.5:0.95/%	85.7	86.3	86.9	85.4	87.5
F1-Score/%	92.0	93.9	93.1	91.6	96.0
Model Size/MB	21.7	25.3	105.6	85.3	3.12
Inference Speed/FPS	35.8	32.1	12.8	16.8	36.3
FLOPs/GFLOPs	28.6	36.2	136.7	103.4	22.4

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, P.; Zhou, J.; Zeng, J.; Zhao, Q.; Yang, Q. MSWindD-YOLO: A Lightweight Edge-Deployable Network for Real-Time Wind Turbine Blade Damage Detection in Sustainable Energy Operations. Sustainability 2025, 17, 8925. https://doi.org/10.3390/su17198925

AMA Style

Li P, Zhou J, Zeng J, Zhao Q, Yang Q. MSWindD-YOLO: A Lightweight Edge-Deployable Network for Real-Time Wind Turbine Blade Damage Detection in Sustainable Energy Operations. Sustainability. 2025; 17(19):8925. https://doi.org/10.3390/su17198925

Chicago/Turabian Style

Li, Pan, Jitao Zhou, Jian Zeng, Qian Zhao, and Qiqi Yang. 2025. "MSWindD-YOLO: A Lightweight Edge-Deployable Network for Real-Time Wind Turbine Blade Damage Detection in Sustainable Energy Operations" Sustainability 17, no. 19: 8925. https://doi.org/10.3390/su17198925

APA Style

Li, P., Zhou, J., Zeng, J., Zhao, Q., & Yang, Q. (2025). MSWindD-YOLO: A Lightweight Edge-Deployable Network for Real-Time Wind Turbine Blade Damage Detection in Sustainable Energy Operations. Sustainability, 17(19), 8925. https://doi.org/10.3390/su17198925

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

MSWindD-YOLO: A Lightweight Edge-Deployable Network for Real-Time Wind Turbine Blade Damage Detection in Sustainable Energy Operations

Abstract

1. Introduction

2. Materials and Methods

2.1. Data Collection

2.2. Data Preprocessing

2.3. MSWindD-YOLO Model Establishment

2.3.1. Overview of the Network Architecture

2.3.2. Lightweight Optimization of Backbone Network

2.3.3. GBC3-FEA Feature Extraction Module

2.3.4. Attention Mechanism Introduction

2.3.5. Loss Function Redefinition

2.4. Experimental Platform and Parameter Configuration

2.5. Edge Computing Equipment

2.6. Evaluation Indictors

3. Results

3.1. Ablation Experiment

3.2. Comparative Experiments with Mainstream Object Detection Algorithms

3.3. Model Deployment and Real-Time Detection

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI