1. Introduction
In recent years, driven by the continued advancement of the carbon peaking and carbon neutrality goals as well as the rapid growth of the renewable energy industry, the installed capacity of photovoltaic (PV) power generation has increased substantially. Meanwhile, PV power stations are evolving toward larger-scale, more centralized, and increasingly intelligent operation [
1,
2,
3]. As the core components of PV power generation systems, PV modules are inevitably subjected to manufacturing imperfections, environmental degradation, thermal cycling, partial shading, and electrical mismatch during long-term operation, which may give rise to thermal defects such as local overheating, hot spots, and abnormal temperature rise [
4]. These defects not only degrade the power generation efficiency of PV modules and accelerate their performance deterioration, but may also trigger encapsulation aging, local burn damage, and even severe fire hazards [
5]. Therefore, the development of rapid, accurate, and non-contact approaches for detecting thermal defects in operating PV modules has become a critical issue in the intelligent operation and maintenance of PV power stations [
6].
Conventional inspection of PV power stations mainly relies on manual examination assisted by handheld thermal imagers. However, such an approach typically suffers from low inspection efficiency, high labor intensity, strong dependence on operator experience, and limited coverage, making it difficult to satisfy the practical requirements of high-frequency and high-precision inspection in large-scale PV power stations [
7]. In contrast, unmanned aerial vehicle (UAV) platforms offer significant advantages, including high mobility, wide inspection coverage, flexible deployment, and superior operational efficiency. When integrated with infrared thermography, UAVs enable long-range and non-contact inspection of large-scale PV arrays under non-shutdown conditions, allowing rapid acquisition of surface temperature field information from PV modules and thereby facilitating effective identification of hot spots, locally abnormal heating regions, and potentially faulty modules. Therefore, thermal defect detection methods for PV modules in UAV-based infrared inspection scenarios have emerged as an important research direction in the intelligent operation and maintenance of PV power stations [
8,
9,
10].
In recent years, deep learning-based object detection algorithms have been increasingly applied to photovoltaic thermal defect detection. Among them, You Only Look Once (YOLO)-based methods have attracted increasing attention owing to their favorable balance between detection accuracy and inference efficiency [
11,
12,
13]. For example, Hong et al. introduced an infrared image enhancement strategy and an improved lightweight YOLOv8n model to improve the recognition accuracy of solar panel defects in infrared images [
14]. However, such methods mainly emphasize image enhancement and lightweight detection, while the joint treatment of small thermal targets, blurred boundaries, and complex UAV inspection backgrounds remains insufficient. Xie et al. proposed ST-YOLO based on YOLOv8s for photovoltaic module defect detection using infrared thermal imaging, improving feature extraction through structural modifications [
15]. Nevertheless, the repeated downsampling and generic feature-fusion process in YOLO-based detectors may still weaken fine-grained thermal cues that are critical for small hot spots and locally abnormal temperature-rise regions. More recently, Ma et al. developed LFS-YOLO for PV panel defect detection using UAV infrared sensors, considering practical issues such as defect morphology variation, unclear boundary features, and small-target defects [
16]. Although these studies have promoted the development of automatic PV thermal defect detection, existing methods still face challenges in simultaneously preserving shallow small-target features, suppressing false alarms caused by repetitive PV array backgrounds and thermal noise, improving localization stability for boundary-ambiguous thermal anomalies, and maintaining model compactness for UAV inspection scenarios. In addition, prompt learning has recently shown potential in remote sensing interpretation tasks, including prompt-guided instance segmentation and change captioning. These studies indicate that prompt-based mechanisms can improve task adaptation and semantic guidance in complex remote sensing scenes. However, their application to UAV-based infrared PV thermal defect detection remains relatively underexplored, especially under lightweight deployment constraints [
17,
18]. Therefore, it is of considerable theoretical significance and practical value to investigate a task-oriented lightweight detection framework for UAV-based infrared photovoltaic inspection [
19,
20,
21].
To address the above challenges, this paper proposes an improved YOLOv8n-based detection algorithm for thermal defect detection of PV modules in UAV-based infrared inspection scenarios. The proposed method is intended to enhance the feature representation capability for small thermal defect targets, strengthen feature extraction and discrimination under complex backgrounds, and reduce model complexity while maintaining high detection performance. To this end, targeted optimizations are conducted at both the network architecture and loss function levels, thereby achieving a coordinated improvement in detection accuracy and inference efficiency. Experimental results demonstrate that the proposed method effectively improves the detection performance of infrared thermal defects and can provide reliable technical support for the intelligent and automated operation and maintenance of PV power stations.
The main contribution of this work lies in developing a task-specific lightweight detection framework for UAV-based infrared thermal defect inspection of photovoltaic modules. Different from directly applying a generic object detector to infrared PV images, the proposed framework is optimized according to the coupled characteristics of this task, including small-scale thermal defects, weak and diffuse thermal responses, repetitive PV array backgrounds, and deployment constraints on UAV inspection platforms. Specifically, the detection scale configuration, lightweight feature extraction path, contextual feature-fusion module, and bounding-box regression loss are jointly adapted to improve small-target perception, background robustness, localization stability, and model compactness. In addition, an expanded UAV infrared PV thermal defect dataset with representative challenging cases is constructed, and the effectiveness of the proposed framework is validated through ablation studies, comparisons with recent lightweight detectors, stricter mAP@0.5:0.95 evaluation, repeated-run statistics, and side-by-side qualitative comparisons.
The main contributions of this paper are summarized as follows:
(1) A task-specific lightweight detection framework is developed for UAV-based infrared thermal defect inspection of photovoltaic modules. The framework is designed to jointly address small-target thermal defect perception, recognition robustness under repetitive PV array backgrounds, localization stability for boundary-ambiguous anomalies, and lightweight deployment requirements.
(2) A multi-scale detection structure is redesigned for small thermal defects by introducing a P2 shallow feature detection branch and removing the redundant large-target branch under the present UAV inspection setting. This design enhances fine-grained thermal feature perception while reducing unnecessary computational overhead for the dominant defect scales in the constructed dataset.
(3) A lightweight and context-enhanced feature representation strategy is constructed by incorporating Ghost Convolution (GhostConv) into the backbone and C2f-Large Separable Kernel Attention (C2f-LSKA) into the neck feature-fusion stage. This design reduces model complexity while strengthening contextual and spatial feature discrimination under complex infrared backgrounds.
(4) A more rigorous experimental validation protocol is established by expanding the annotated UAV infrared PV thermal defect dataset to 3000 images, covering representative challenging cases such as small hot spots, low-contrast defects, complex background interference, and diffuse abnormal temperature-rise regions. Extensive ablation experiments, comparisons with recent lightweight detectors, mAP@0.5:0.95 evaluation, repeated-run statistics, and side-by-side qualitative results are provided to validate the effectiveness and stability of the proposed method.
2. YOLOv8 Network Architecture
YOLOv8 represents the latest advancement in the YOLO series of object detection algorithms. Building upon the strengths of previous versions, it further improves detection performance and inference speed through a series of structural optimizations and functional enhancements. To accommodate diverse application requirements, YOLOv8 provides five model variants, namely N, S, M, L, and X, according to different scaling coefficients [
22]. According to the official Ultralytics implementation, YOLOv8 adopts advanced backbone and neck architectures together with an anchor-free split head. In the YOLOv8n configuration adopted in this study, the backbone is mainly composed of Conv, C2f, and Spatial Pyramid Pooling-Fast (SPPF) modules, the neck performs multi-scale feature fusion through upsampling, concatenation, and C2f blocks, and the head outputs predictions at three scales through the Detect layer [
23]. As illustrated in
Figure 1, the adopted YOLOv8n baseline progressively generates feature maps from P1 to P5 and performs detection on the P3, P4, and P5 feature levels.
In the backbone stage, the YOLOv8n configuration adopted in this study mainly uses stacked Conv and C2f modules together with an SPPF module for hierarchical feature extraction. Through progressive downsampling, the input image is transformed into multi-scale feature maps, enabling the network to capture objects of different sizes while maintaining a balance between detection accuracy and computational efficiency. After five downsampling operations, the input image is transformed into five feature maps at different scales, denoted as P1 to P5, allowing the model to effectively perceive and detect objects of various sizes. Compared with YOLOv5, YOLOv8 introduces more refined architectural optimization by replacing the original C3 module with the C2f structure. Through richer skip connections and feature split operations, the C2f module enhances gradient flow propagation, thereby enabling the model to learn and preserve critical feature information more effectively. In addition, a SPPF module is deployed at the end of the backbone. By sequentially stacking three max-pooling layers, the SPPF module captures multi-scale receptive field information. Compared with the conventional SPP structure, SPPF reduces computational cost and inference latency while maintaining the diversity and richness of feature representation, thereby improving the adaptability and detection accuracy of the model for input features of different scales.
In the neck stage, YOLOv8 adopts a Path Aggregation Network–Feature Pyramid Network (PAN-FPN) architecture [
24] to fuse features from different hierarchical levels. This structure enhances the interaction between shallow detailed features and deep semantic features, optimizes the information flow across feature layers, and consequently improves the detection performance for multi-scale targets.
In the head stage, YOLOv8 employs a decoupled head design, which separates the classification task from the bounding box regression task, thereby improving detection accuracy and training stability. Moreover, YOLOv8 supports multi-scale prediction based on feature maps with downsampling factors of 8, 16, and 32, further enhancing its flexibility and accuracy in detecting targets of different sizes.
3. Improved Network Design
The improved network is designed according to the specific visual and deployment characteristics of UAV-based infrared PV inspection rather than by simply stacking independent modules. The overall design follows four task-driven objectives: preserving fine-grained thermal cues for small defects, reducing redundant computation for lightweight deployment, enhancing contextual discrimination under repetitive PV array backgrounds, and improving localization stability for boundary-ambiguous thermal anomalies. The proposed improvements are integrated into YOLOv8n in a structured manner at different stages of the network. Specifically, a P2 shallow feature layer is introduced into the detection head to enhance small-target thermal defect perception, while the original P5 large-target detection branch is removed as a task-specific structural simplification under the present UAV infrared inspection setting. As a result, the improved detector performs prediction on the P2, P3, and P4 feature levels. GhostConv is introduced into the backbone to reduce redundant computation and improve deployment efficiency. In the neck, part of the original C2f blocks are replaced by the proposed C2f-LSKA blocks to enhance contextual representation under complex backgrounds. In addition, the original bounding-box regression loss is replaced with Wise-IoU version 3 (WIoUv3) during training. The overall integration of these components into the adopted YOLOv8n baseline is illustrated in
Table 1. In this design, the overall feature-extraction, feature-fusion, and prediction pipeline of YOLOv8n is retained, while the modifications are limited to the task-specific detection scale configuration, lightweight convolution replacement, neck-stage contextual enhancement, and bounding-box regression loss.
3.1. P2-Based Detection Head for Small Thermal Defects
In UAV-based infrared inspection scenarios for PV modules, thermal defect targets are generally manifested only as local high-temperature bright spots or anomalous temperature-rise regions due to variations in flight altitude, imaging perspective, and the projected scale of modules in the captured images. As a result, these targets usually exhibit characteristics such as small size, blurred boundaries, and irregular shapes. In the original YOLOv8 model, the input image is processed through multiple downsampling operations to generate five feature layers at different scales, namely P1, P2, P3, P4, and P5, while three detection heads are mainly constructed based on P3, P4, and P5, corresponding to feature maps of 80 × 80, 40 × 40, and 20 × 20, respectively. However, for small thermal defects in UAV-based infrared inspection images, repeated downsampling tends to cause the loss of shallow positional information and fine-grained thermal features, thereby limiting the capability of the original three-scale detection head structure to adequately perceive and accurately localize such targets.
To address the above issue, a P2-based detection head with a feature resolution of 160 × 160 is further introduced into the original network architecture. This design enables the model to better preserve and exploit the richer spatial positional cues and fine-grained local thermal anomaly details embedded in shallow features. As a result, the proposed network achieves improved representation and discrimination of small-scale hot spots and locally abnormal temperature-rise regions in infrared images, thereby alleviating the deficiency of the original YOLOv8 in small-target thermal defect detection. The architecture of the improved detection head with the P2 branch is illustrated in
Figure 2.
The improved detection structure performs prediction on three feature levels, corresponding to P2, P3, and P4 with feature resolutions of 160 × 160, 80 × 80, and 40 × 40, respectively. Compared with the original YOLOv8n detection head based on P3, P4, and P5, the proposed structure introduces the high-resolution P2 branch to strengthen small-target thermal defect perception and removes the original P5 branch to reduce unnecessary computation for the dominant defect scales in the constructed dataset. The 160 × 160 P2 branch is specifically used to preserve fine-grained shallow thermal cues, while the P3 and P4 branches provide complementary feature representations for small- and medium-scale thermal anomalies. It should be noted that not all thermal anomalies in UAV infrared PV inspection are strictly small targets. Diffuse abnormal temperature-rise regions may also appear in some images. However, under the current acquisition condition with an approximate flight altitude of 100 m, most annotated thermal defects are projected as small or medium-scale regions in the resized 640 × 640 images. Therefore, the removal of the original P5 large-target detection branch is adopted as a task-specific structural simplification rather than a universal design choice for all inspection scenarios. The purpose is to allocate more representation capacity to shallow and middle-level features that are more relevant to the dominant defect scales in the constructed dataset, while reducing unnecessary computational overhead. The ablation results further show that this simplification does not impair the overall detection performance under the evaluated dataset, although its applicability to scenarios with a higher proportion of large-area thermal anomalies should be further verified.
3.2. GhostConv Module
GhostConv, proposed by Huawei in 2020, was first introduced in the GhostNet architecture [
25]. Its core idea is to generate only a portion of the intrinsic feature maps through standard convolution and then produce additional representative feature maps by using inexpensive linear transformations, thereby reducing model parameters and computational cost while preserving feature representation capability.
As illustrated in
Figure 3, the input feature map is first processed by a set of standard convolution kernels to generate part of the output feature maps. Subsequently, inexpensive linear operations are applied to these feature maps to produce another part of the output features. In practice, such inexpensive operations can be implemented by depthwise convolutions or small-kernel convolutions. In this manner, each original feature map can be used to derive multiple additional feature maps. Although the newly generated feature maps differ from the original ones in terms of information representation, they are complementary in the channel dimension and jointly enhance the model’s capability to represent the input data. Finally, the feature maps produced by standard convolution and those generated by inexpensive operations are concatenated along the channel dimension through a Concat module to form the final output feature map. Owing to this design, GhostConv effectively reduces the number of model parameters and computational complexity while maintaining strong feature representation capability, making it particularly suitable for deployment scenarios with limited computational resources, such as mobile and embedded devices. Therefore, the motivation of introducing GhostConv is to improve deployment efficiency by reducing redundant computation while preserving effective feature representation.
Assume that the kernel size of the standard convolution is
k ×
k, the spatial resolution of the input feature map is
H ×
W, and the numbers of input and output channels are
Cin and
Cout, respectively. Under these definitions, the computational complexity of GhostConv throughout the entire feature generation process can be formulated as
where
FLOPsconv denotes the computational cost of generating half of the feature maps by standard convolution in the first stage, and
FLOPsCheap denotes the computational cost of generating the other half through inexpensive operations in the second stage.
If standard convolution is directly applied to the input feature map to generate output feature maps of the same size as those produced by GhostConv, the computational cost of the entire convolution process can be expressed as follows:
Based on Equations (3) and (4), the computational cost ratio of GhostConv to standard convolution can be further derived as follows:
When the convolution kernel size is 3 × 3, it can be further observed that the computational cost of GhostConv in feature extraction is only approximately 50–60% of that of standard convolution, indicating a substantial reduction in computational overhead.
In the adopted implementation, GhostConv is used to replace the corresponding standard convolution operations in the lightweight backbone path, while keeping the overall feature extraction pipeline of YOLOv8n unchanged. This design reduces computational burden without altering the basic multi-scale detection framework.
3.3. C2f-LSKA Module
During UAV-based infrared inspection of PV modules, the detection of thermal defect targets is highly susceptible to complex background interference, including ground textures, shadow effects, repetitive array structures of PV modules, and infrared thermal noise. In addition, the task is further challenged by significant target-scale variations and blurred boundaries of thermal spots. To address these issues, a Large Separable Kernel Attention (LSKA) module is introduced in this study. By leveraging separable large-kernel convolutions, the LSKA module enhances the modeling of correlations among different regions of the feature map, thereby improving the representation capability of thermal defect targets in complex scenes while controlling the number of parameters and computational complexity.
Specifically, the LSKA module decomposes a large-kernel convolution into three components, namely depthwise convolution (DW-Conv), depthwise dilated convolution (DW-D-Conv), and channel convolution (1 × 1 convolution). Subsequently, the two-dimensional kernels in the depthwise convolution and depthwise dilated convolution are further factorized into two one-dimensional kernels, where an
N × 1 kernel is used to perform convolution along the vertical direction and a 1 ×
N kernel is used to perform convolution along the horizontal direction. Through this decomposition strategy, the module achieves an effect approximately equivalent to that of large-kernel convolution while effectively reducing parameter count and computational overhead and improving computational efficiency, as illustrated in
Figure 4. This design not only significantly lowers computational complexity, but also enables the model to capture richer long-range spatial dependency information from the input feature maps, thereby enhancing the recognition capability for thermal defect targets under complex backgrounds.
To achieve an effective integration of the C2f structure and the LSKA module, an LSKA structure is designed in this study, as illustrated in
Figure 5. Specifically, the input feature map is first normalized by batch normalization, followed by nonlinear activation using the GELU function, and is then fed into the LSKA module for feature enhancement. The resulting features are finally fused with the original input feature map through a residual connection. Notably, two 1 × 1 convolution layers are introduced at the two ends of the LSKA module. The first one is used to reduce the channel dimension of the feature map, while the second one restores the channel dimension to match that of the input. This design reduces the number of model parameters and computational overhead while further enhancing the feature representation capability.
In this study, the Bottleneck units in the original C2f module are replaced with the proposed LSKA structure, thereby forming a new C2f-LSKA module, as illustrated in
Figure 6. The original C2f module mainly focuses on capturing local features from the input data. However, its limitation lies in its emphasis on modeling only small local regions, making it difficult to establish sufficient correlations among different regions of the feature map. As a consequence, its feature representation capability and robustness may be constrained in complex visual scenarios. In contrast, the proposed C2f-LSKA module integrates the C2f structure with the LSKA mechanism, which significantly enhances the model’s ability to capture contextual semantic information, long-range dependencies, and spatial structural characteristics. As a result, the proposed module improves both the recognition robustness and the feature representation capability for infrared thermal defect targets under complex backgrounds. Thus, the proposed C2f-LSKA module is motivated by the need to enhance contextual modeling and robustness in complex UAV infrared inspection backgrounds.
In the adopted YOLOv8n architecture, the proposed C2f-LSKA blocks are deployed in the neck feature-fusion stage by partially replacing the original C2f modules. This partial replacement strategy is adopted to balance contextual feature enhancement and lightweight deployment. Specifically, C2f-LSKA is introduced into selected neck positions to strengthen long-range dependency modeling and background discrimination, while the remaining original C2f modules are retained to preserve efficient local feature extraction and feature reuse. In contrast, full replacement may introduce additional computational burden without necessarily providing a better accuracy-efficiency trade-off. Therefore, partial replacement is adopted in the final model, and its effectiveness is further verified by the comparison with full neck replacement in the ablation study.
3.4. Loss Function Improvement
A well-designed loss function plays a crucial role in improving the overall performance of an object detection model. In the original YOLOv8 framework, bounding box regression is optimized by a combination of Distribution Focal Loss (DFL) and Complete Intersection over Union (CIoU) loss. Among them, CIoU constrains the bounding box prediction process by jointly considering the center-point distance between the predicted box and the ground-truth box, the overlap region, and the aspect ratio discrepancy. However, in the task of small thermal defect detection from UAV-based infrared inspection images, CIoU still exhibits several limitations.
First, thermal defects are typically manifested only as localized high-temperature anomalous regions with small target scales, making slight variations in IoU insufficiently sensitive to the actual localization error. Second, thermal spots often exhibit irregular shapes and blurred boundaries, which limits the adaptability of CIoU to such geometric variations. Third, the balancing term in CIoU is formulated in a fixed manner, making it difficult to dynamically adapt the optimization emphasis to thermal defect samples of different scales and quality levels. As a result, its effectiveness in complex scenarios may be constrained. The corresponding formulations are given in Equations (6)–(8):
The definitions of the involved parameters are illustrated in
Figure 7. Specifically,
IoU denotes the intersection over union between the predicted box and the ground-truth box;
ρ(
b,
bgt) represents the Euclidean distance between the center points of the predicted box b and the ground-truth box b_gt, and c denotes the diagonal length of the minimum enclosing box covering both boxes; α is a balancing factor used to weight the penalty term for shape mismatch;
v is used to measure the similarity in aspect ratio between the predicted box and the ground-truth box;
h and
w denote the height and width of the predicted box, respectively;
hgt and
wgt denote the height and width of the ground-truth box, respectively.
In this study, the WIoUv3 is adopted as the bounding box regression loss [
26]. Compared with CIoU, WIoUv3 first introduces a dynamic non-monotonic focusing mechanism, which enables the model to pay greater attention to localization quality during training and thus improves its sensitivity to small targets. In addition, WIoUv3 appropriately reduces the emphasis on low-quality anchor boxes in the later stage of training, thereby alleviating the influence of harmful gradients on parameter updates and improving both training efficiency and optimization stability. Furthermore, WIoUv3 places greater emphasis on the accuracy of center position regression, which is beneficial for improving the localization precision of small-scale thermal defect targets.
For the UAV-based infrared photovoltaic inspection task investigated in this study, these characteristics of WIoUv3 enable the model to more accurately regress the locations and boundaries of small-scale thermal defect targets while reducing the interference of low-quality samples during training. As a result, the proposed method further improves localization accuracy, training stability, and generalization performance in infrared thermal defect detection. The corresponding formulations are given in Equations (9)–(11).
where
β denotes the outlier degree of the anchor box, with a smaller value indicating a higher-quality anchor box;
γ is the non-monotonic focusing coefficient used to suppress the interference of low-quality anchor boxes during training;
δ is a hyperparameter introduced to regulate the effect of
β; (
x,
y) and (
xgt,
ygt) denote the center coordinates of the predicted box and the ground-truth box, respectively;
Wg and
Hg represent the width and height of the minimum enclosing box covering both the predicted box and the ground-truth box, respectively;
Su denotes the area of the non-overlapping region between the predicted box and the ground-truth box;
denotes the
IoU loss of the current anchor box, and
denotes the mean
IoU loss.
For the infrared thermal defect detection task, the dynamic non-monotonic focusing mechanism of WIoUv3 is beneficial for handling small targets with blurred boundaries. In UAV infrared PV images, weak hot spots and diffuse abnormal temperature-rise regions often exhibit fuzzy thermal transitions, and extremely low-quality predicted boxes may introduce unstable regression gradients. WIoUv3 uses the outlier degree β to adaptively adjust the regression weight in a non-monotonic manner, so that ordinary-quality samples with useful localization information receive more attention, whereas extremely low-quality samples caused by ambiguous boundaries or thermal noise are relatively suppressed. This helps improve localization stability for boundary-ambiguous thermal defects. In this study, the focusing parameters were set as γ = 1.9 and δ = 3.0, where γ controls the strength of the non-monotonic focusing effect and δ regulates the influence of β on the regression weight. This setting provides a moderate focusing effect and was kept fixed in all experiments to ensure fair comparison among different ablation variants. During training, WIoUv3 is incorporated by replacing the original CIoU-based bounding-box regression loss in YOLOv8, while the classification-related loss, distribution focal loss, optimizer, training schedule, and other settings remain consistent with the adopted YOLOv8n baseline unless otherwise specified.
In summary, targeted improvements are introduced into the backbone, neck, and head of the YOLOv8 network in this study, resulting in an enhanced model for thermal defect detection in UAV-based infrared inspection scenarios. The overall architecture of the improved network is illustrated in
Figure 8. Accordingly, the adoption of WIoUv3 is motivated by the need for more stable localization learning for small thermal defects with ambiguous boundaries.
5. Conclusions
This paper proposes an improved YOLOv8n-based target detection algorithm for surface thermal defects of PV modules. First, a new extra-small target detection layer is introduced into the original three-scale detection framework, while the largest-scale detection layer is removed, thereby improving the detection accuracy of small-scale thermal defect targets while reducing model complexity. Second, a lightweight GhostConv module is incorporated to further reduce the number of model parameters and compress the model size. Subsequently, a new C2f-LSKA module is designed in the neck network to replace part of the original C2f modules, thereby enhancing the model’s capability to capture contextual semantic information and model spatial structural features. Finally, WIoUv3 is adopted as the bounding-box regression loss to improve regression stability and localization accuracy during training. Experimental results on the expanded self-constructed UAV-based infrared thermography dataset show that, compared with the baseline YOLOv8n model, the proposed method improves Precision, Recall, mAP@0.5, and mAP@0.5:0.95 by 5.1, 11.4, 9.6, and 13.2 percentage points, respectively, while reducing the number of parameters and model size by 65.8% and 61.9%, respectively. In the comparative experiments, the proposed method achieves competitive mAP@0.5 and mAP@0.5:0.95 among the compared detectors, reaching 93.9 ± 0.2% and 72.3 ± 0.3%, respectively. These results indicate improved detection accuracy and localization quality under the evaluated UAV infrared inspection setting while maintaining lightweight characteristics. Several limitations should be acknowledged. The current experiments were conducted on a self-constructed UAV infrared dataset acquired using a single UAV platform and infrared camera at a relatively consistent flight altitude. Thus, the reported results mainly reflect the effectiveness of the proposed method under the evaluated inspection configuration, while its cross-platform, cross-sensor, cross-altitude, and cross-scene generalization still requires further validation. In addition, the runtime evaluation was performed on a GPU-based experimental platform rather than a dedicated onboard edge device. Future work will focus on multi-site data collection, compatible public-dataset validation, edge-platform deployment tests, and robustness improvement under weak low-contrast defects, diffuse abnormal temperature-rise regions, and complex thermal background interference. In addition, prompt learning and foundation-model-based remote sensing interpretation will be further explored to incorporate defect-location priors, thermal anomaly descriptions, and expert guidance for improving cross-scene adaptation in UAV infrared PV inspection.