1. Introduction
Despite the numerous environmental problems caused by the use of fossil fuels, the growing global demand for energy continues to increase. This demand is especially driven by the rapid economic growth of developing countries, resulting in increased greenhouse gas emissions and exacerbated pollution problems [
1]. Solar energy, as a clean and sustainable energy source, has been expanding its scope of application and is gradually becoming one of the core strategies to mitigate climate change and achieve carbon neutrality. Among these, photovoltaic (PV) power generation technology, as the primary method of harnessing solar energy, is advancing rapidly. Solar panels are the core component of a photovoltaic power system, and their the performance and quality directly determine the efficiency [
2].
However, increased reliance on solar energy also brings technical challenges, particularly with regard to the reliability of photovoltaic components. In practical applications, solar panels often exhibit various defects due to several factors. Material limitations, mechanical and thermal deviations during processing, and prolonged operation contribute to defects such as cracks, fingers, black cores, and thick lines [
3]. These defects can reduce energy conversion efficiency, lead to power degradation [
4,
5], and even cause the failure of entire modules, thereby seriously affecting the long-term operation and economic performance of PV systems and increasing maintenance costs. Efficient and accurate defect detection has become crucial to ensure solar panel quality and enhances PV system reliability.
In order to address the above challenges, researchers have explored various optimization strategies for solar energy systems, including material improvements and advanced monitoring technologies. Other areas of solar energy technology have accumulated valuable optimization experiences. For example, parabolic trough solar collectors (PTSC) have improved system robustness by adopting material optimization measures such as nanofluids and selective coatings [
6], providing a reference for preventive maintenance of photovoltaic systems. Intelligent image recognition technology has also brought new opportunities for improving defect detection efficiency. Image processing methods based on deep learning, such as the staged reconstruction technique of generative adversarial networks (GAN), have demonstrated significant advantages in extracting micro-features [
7]. These methods can further optimize the accuracy of crack recognition.
These challenges are further exacerbated in distributed photovoltaic scenarios, where system constraints and environmental complexity require higher detection accuracy. Green building and low-carbon city concepts have promoted the increased deployment of distributed photovoltaic (PV) systems. These systems are commonly installed in built environments such as industrial parks, commercial rooftops, and residential buildings. However, the available installation area in such environments is limited, and the system structures are typically more compact. This imposes higher requirements on PV module quality, power generation efficiency, and fault detection capabilities [
8,
9,
10]. The performance of PV power generation is closely linked to the quality of solar panels [
2,
3], and even minor defects can negatively affect output power and operational stability. Furthermore, solar panels operate outdoors for extended periods under harsh conditions such as high temperatures, dust, humidity, and corrosive environments. Without timely maintenance, they are susceptible to cracking, hot spots, and other forms of functional degradation [
11]. These factors reduce the operational lifespan of the modules and impair overall system performance. Therefore, developing a reliable and efficient method for defect detection is critical to ensuring the stable operation of centralized and distributed PV systems.
Currently, solar panel defect detection methods mainly include manual, physical, and automated inspection techniques based on machine vision [
12]. Among these, manual visual inspection suffers from high subjectivity and low efficiency. Physical inspection methods typically employ sensors such as ultrasonic, laser, and electrical probes to collect surface condition data from solar panels [
13,
14]. However, these methods can be intrusive and may potentially damage the panels. In contrast, machine vision approaches—based on image processing and feature extraction—offer enhanced objectivity and automation. For example, Tsai et al. [
15] proposed an anisotropic diffusion-based detection method that effectively segments microcracks using image enhancement and morphological operations. Kang et al. [
16] employed a Kalman filtering algorithm to analyze electrical parameters and identify abnormal power declines, although their method could not localize specific faults. Al-Waisy et al. [
17] developed a hybrid classification system by combining Inception-V3 and ResNet50 models for defect classification. Venkatesh et al. [
18] proposed a fault identification method using aerial imagery. Their approach integrated CNN-based feature extraction with a decision tree classifier to achieve fault detection.
In recent years, deep learning-based methods have become the dominant paradigm for solar panel defect detection due to their superior feature extraction capabilities. For instance, Cao et al. [
19] proposed the YOLOv5s-GBC algorithm to enhance detection accuracy and later developed YOLOv8-gd [
20], which achieves a lightweight design through depth-wise separable convolution and a BiFPN structure. Huang et al. [
12] introduced YOLOv5-BDL, integrating an improved LCA attention mechanism, while Zhang et al. [
21] incorporated a deformable convolutional CSP module and an ECA attention mechanism into YOLOv5. Notably, Rohith et al. [
22] developed SparkNet with an innovative Fire Modules architecture, which achieved 95% detection accuracy for surface contaminants through its squeeze-expand operations. In the field of hot spot detection, Khang et al. [
23] used the RetinaNet framework to process thermal imaging data and achieve automatic identification of hot spots in photovoltaic modules. Bassil et al. [
24] proposed a hybrid detection framework that combines the feature extraction capabilities of EfficientNetB7 and visual Transformers to achieve an accuracy rate of 97% in photovoltaic panel dust recognition. Karakan et al. [
25] built a multi-architecture comparison system, and SqueezeNet achieved the best accuracy of single crystal (97.82%)/multiple crystals (96.29%) in EL image defect classification. Similarly, Li et al. [
26] proposed GBH-YOLOv5, which enhances detection performance for small objects. In addition to convolutional neural networks, transformer architectures have shown promising potential in PV applications. Dwivedi et al. [
27] were the first to apply Vision Transformer (ViT) models to solar panel detection. Zhuang et al. [
28] proposed a multi-component attention convolution method that improves feature extraction. To address the issue of data imbalance, Jiang et al. [
29] utilized generative adversarial networks (GANs) for data augmentation. Tang et al. [
30] enhanced the recognition of linear defects by integrating the Hessian matrix into the convolutional framework. Moreover, Zhang et al. [
31] proposed a saliency-guided neural network to improve the segmentation of electroluminescence (EL) images.
Beyond single-modality approaches, multimodal fusion techniques have also gained traction to enhance robustness and fault detection accuracy. For multimodal fusion approaches, Di Tommaso et al. [
10] developed an automated multi-stage model based on the YOLOv3 network and computer vision techniques to process thermal and visible images for detecting various defects. Lei et al. [
32] introduced a Deeplab-YOLO approach that combines Deeplabv3+ for image segmentation with YOLOv5 for defect detection, specifically targeting hot spots on infrared images of photovoltaic panels. Zhao et al. [
33] proposed the PV-UNet model, which demonstrates strong defect localization capabilities in remote sensing images. Furthermore, with the widespread adoption of UAV technology, UAV-based PV inspection systems have emerged as a research hotspot [
34,
35], enabling effective coverage of large power station areas and the identification of localized, small-scale defects. These advancements have significantly improved the precision, reliability, and practicality of photovoltaic fault detection.
In existing research on defect detection in solar panels, despite significant progress, there are still limitations in the coverage of defect types. For example, Cao et al. [
19] primarily focused on two basic defect types. The hybrid detection framework proposed by Bassil et al. [
24] only implements the presence or absence discrimination of dusty cracks and spots; YOLOv8-gd [
20] covered seven defect types,: black core sheets, black spot sheets, short circuit black sheets, over-soldered sheets, broken grids, bright and dark sheets, and hidden cracks. However, it fails to cover critical geometric defects such as star cracks, which often severely impact the structural integrity of the components. Zhang et al. [
21] and Jiang et al. [
29] focused on limited defect types such as cracks, fingerprints, and scratches. Li et al. [
26] examined five types of cell-level anomalies, including damaged cells and cells with obvious bright spots. However, they did not cover critical electrical safety hazards like short circuits.
In contrast, the DCE-YOLO model proposed in this study achieves more comprehensive and detailed defect coverage. It covers seven typical and representative defect types, including cracks, finger breaks, black cores, thick lines, star cracks, horizontal dislocations, and short circuits. Among these, star cracks and horizontal dislocations, as key geometric defects, have not been adequately addressed in previous studies. Meanwhile, short circuits, as severe electrical faults, remain a blind spot in many existing detection methods. Such defects not only directly impact the performance and lifespan of photovoltaic modules but also pose risks to operational safety and maintenance costs.
Therefore, DCE-YOLO achieves significant breakthroughs in defect type diversity and practical application value, providing a robust technical foundation for comprehensive detection and precise maintenance of photovoltaic modules. The main contributions of this paper are as follows:
This method addresses YOLOv8’s shortcomings in managing complex backgrounds and long-range dependencies by integrating the Contextual Transformer (COT) attention mechanism to enhance spatial contextual relationships. It also improves symmetry-aware feature representation by fusing static local and dynamic global context cues, thereby improving the model’s robustness in complex environments and its ability to capture structured defect patterns.
The backbone network integrates the C2f-DWR-DRB module, which combines a scalable receptive field design with a standard convolution structure. Its parallel Dilated Reparam Block (DRB) branch employs symmetric rate-configured dilated convolutions, enabling better modeling of the spatial symmetry commonly found in solar panel defect patterns. This integration improves multi-scale feature extraction while reducing computational complexity, achieving a better balance between accuracy and efficiency. This integration improves the capacity for multi-scale feature extraction while reducing computational complexity, thus achieving a more favorable trade-off between accuracy and efficiency.
The detection head of YOLOv8 is improved, and a lightweight detection head Detect-Efficient is designed, which optimizes the utilization of computing resources and improves the detection accuracy and speed of the model. This makes the model more suitable for real-time deployment on edge devices.
The Wise-IoU (WIoU) loss function is used instead of the original loss function to solve the problem of uneven sample quality, enhancing the model’s robustness to noisy or low-quality annotations and improving convergence stability during training.
3. Methodology
This paper proposes a new architecture, DCE-YOLO, integrating C2f-DWR-DRB, Contextual Transformer (COT) Attention, and Detect-Efficient modules into the YOLOv8 framework. Specifically, a COT attention mechanism is added after the SPPF module in the backbone to enhance contextual relationships across different spatial locations in the feature map. The DWR-DRB module replaces the original neck structure within the C2f module to enable more effective extraction of multi-scale contextual information. A lightweight detection head is introduced to improve both detection speed and accuracy. Furthermore, Wise-IoU (WIoU) is adopted in place of the traditional CIoU loss function to enhance overall detection performance. The structure of DCE-YOLO is illustrated in
Figure 2.
3.1. C2f-DWR-DRB
In the task of solar panel defect detection, YOLOv8 exhibits certain limitations in processing multi-scale features. This is especially evident when dealing with symmetric or repetitive structural patterns commonly observed in defects such as grid-like cracks or fingerprint traces. Traditional convolutional neural networks often rely on deeper hierarchies or more complex structures to address multi-scale variations, which can compromise the real-time performance of the model. To enhance YOLOv8’s capability in handling multi-scale features, this study proposes replacing the Bottleneck module in the original C2f structure with a DWR-DRB module, forming a new C2f-DWR-DRB module.
By integrating the DWR and DRB structures, the proposed module can efficiently extract contextual information across multiple scales, thereby enhancing the backbone network’s feature extraction capability. Embedding the C2f-DWR-DRB module within the backbone minimizes computational complexity while improving detection performance. The original C2f module structure in YOLOv8 is illustrated in
Figure 3a, while the modified C2f-DWR-DRB structure proposed in this study is shown in
Figure 3b.
3.1.1. DWR
The DWR (Dilation-wise Residual) [
45] module adopts a residual connection architecture, as shown in
Figure 4. This module divides the feature extraction process into two stages, regional residual and semantic residual, which efficiently capture multi-scale contextual information and integrate feature maps generated by multi-scale perception. First, the input features are preprocessed through a 3 × 3 convolution layer, followed by batch normalization (BN) and ReLU activation functions. This generates a series of feature maps of varying sizes, referred to as region activations. This stage primarily extracts local features, laying the foundation for subsequent multi-scale processing. In the second step, three 3 × 3 depth-wise convolution (DConv) layers with different dilation rates are used to simplify the convolution operation in the depth direction. This multi-dilation rate design constructs receptive fields of different scales. Small dilation rates focus on local detail features, medium dilation rates capture medium-range context, and large dilation rates model more global spatial relationships. Through this multi-scale feature fusion strategy, the module can better retain the filtering process of important information at different scales, which is called semantic reconstruction. Subsequently, a 1 × 1 convolution layer maps the concatenated high-dimensional features to a low-dimensional space, thereby reducing model complexity. Finally, residual connections are used to add the dimension-reduced features to the original input features, achieving feature reuse.
It should be noted that this paper optimizes and adjusts the convolution structure of the DWR module while retaining its core structure and multi-scale processing mechanism. Specifically, we replace the multi-scale depth separable convolution in the original design with standard convolution and the DRB module. This modification reduces computational complexity and improves computational efficiency. The improved structure, together with the DRB module, constitutes the DWR-DRB module. The detailed design will be discussed in
Section 3.1.3.
3.1.2. DRB
The DRB (Dilated Reparam Block) [
46] module aims to extend the large kernel convolutional layer by merging the non-dilated large kernel convolution with many extended small kernel convolutional layers. Specifically, features are extracted from the input feature map using a non-dilated large kernel convolution and multiple dilated small kernel convolutions. Each convolutional layer learns a different representation of the features and sums the output features after their respective batch normalization (BN) layers. When converting the DRB module to a large kernel convolutional layer for inference, the output of each convolutional layer needs to be merged into the output of the large kernel convolution. This is converted into a convolutional layer with an expansion of r > 1 by a specific function with an appropriate amount of zero padding on both sides (ignoring input pixels in the inflated convolution is equivalent to inserting additional zero entries in the convolutional kernel). The size of the original convolution kernel is
, and the convolution kernel after inserting zeros is called the equivalent convolution kernel, which has the size
.
Based on this improvement, this paper designs two DRB modules to improve the DWR module. The first structure employs a double-parallel-layer configuration, as illustrated in
Figure 5a. This structure uses dilation rates of r = (1, 2) and a convolution kernel size of k = (3, 3) to simulate a large kernel convolution with an effective size of K = 5. The second is a three-parallel-layer structure, shown in
Figure 5b, which uses a dilation rate of r = (1, 2, 3) and a kernel size of k = (5, 3, 3) to enhance a large kernel convolution with K = 7.
3.1.3. DWR-DRB
To improve the efficiency and multi-scale contextual modeling capabilities of the DWR module in solar cell defect detection, this paper proposes the DWR-DRB structure, as shown in
Figure 6. This structure first replaces the multi-scale depth-separable convolutions in the DWR module with standard 3 × 3 convolutions to reduce computational complexity. Two dilated reparameterized blocks (DRB) are then introduced to construct equivalent receptive field sizes of K = 5 and K = 7, enhancing the module’s ability to semantically model defects of different scales and structures. The DWR module employs a two-stage structure of “regional residuals” and “semantic residuals.” The module first uses shallow convolutions to capture local activation information. It combines convolutional branches with different dilation rates to integrate deep semantic context, effectively enhancing the multi-level expression of spatial structure. The DRB module further employs a parallel convolutional structure with multiple dilation rates r = 1, 2 or r = 1, 2, 3 to concurrently model local and remote feature dependencies across multiple receptive field scales, thereby establishing cross-scale contextual feature fusion capabilities. The DRB structure with K = 5 provides fine-grained perception capabilities for local defects such as finger, black_core, and short_circuit, while the structure with K = 7 is suitable for capturing larger-scale structural defects such as crack, star_crack, and horizontal_dislocation, enhancing the model’s structured understanding of overall defect morphology. Meanwhile, the symmetric receptive fields constructed by DRB facilitate the explicit modeling of common defect structures such as bilateral symmetry, radial symmetry, and periodic distribution. This design improves the model’s response to repetitive textures and edge microstructures. The DWR and DRB structures provide context-aware and structural modeling capabilities at different scales, respectively. Their combination achieves more comprehensive semantic fusion and geometric symmetry perception, while complementing the COT structure discussed later to collectively enhance model performance.
3.2. COT Attention
Traditional Convolutional Neural Networks (CNNs) exhibit limitations in capturing long-distance contextual information. These networks struggle to effectively extract and integrate semantic information at a distance. Consequently, YOLOv8 demonstrates suboptimal performance when handling complex contexts and long-distance dependencies. Thus, this paper designs a Contextual Transformer (COT) attention mechanism, the structure of which is shown in
Figure 7.
The COT attention mechanism module enhances feature representation by fusing static and dynamic contexts. The module first performs a k × k convolution operation on the input feature map to extract local static context information, denoted as K1. The static context K1 is then concatenated with the query vector Q. The dynamic multi-head attention weight matrix is generated by two 1 × 1 convolutional layers. Unlike the traditional self-attention mechanism, this method is based on a local correlation matrix at each position. This method reinforces attention learning through the interaction between queries and keys while integrating static context. Subsequently, the dynamic attention weight matrix is multiplied by the value vector V, processed via a 1 × 1 convolution, to obtain the global dynamic context information denoted as K2. Ultimately, the model’s ability to capture spatial relationships in the feature map is significantly enhanced by fusing the static context K1 and the global context K2, enabling effective modeling of dynamic dependencies among the input features. The fusion of local static and global dynamic contexts allows the model to better capture spatial symmetries that frequently appear in solar panel defect patterns, such as bilateral or radial symmetry, thereby improving symmetry-aware contextual representation.
The COT attention mechanism focuses on enhancing the contextual relationships between different locations in the feature map. It captures relationships between features at the local scale and models long-distance dependencies at the global scale. The C2f-DWR-DRB module enhances the model’s ability to precisely detect and locate flaws of various sizes by effectively aggregating multi-scale contextual information. Global and local interactions are combined to enhance the expressive power of feature representations, enabling a richer portrayal of contextual information in intricate visual scenes. This improves the model’s accuracy in flaw detection.
3.3. Detect-Efficient
YOLOv8 performs well in target detection tasks due to its efficient network structure and refined detection head design. The YOLOv8 detection head consists of two branches: one for regression and one for classification. These branches extract features using two consecutive 3 × 3 convolutions and a single 1 × 1 convolution, respectively. Finally, the model computes the bounding box loss (Bbox.loss) and classification loss (Cls.loss) separately. Although the two consecutive 3 × 3 convolutions can extract richer features, they may also introduce feature redundancy and cause some loss of fine-grained details.
To this end, this paper designs a lightweight detection head, Detect-Efficient, which optimizes the utilization of computing resources and improves the detection accuracy and speed of the model. Detect-Efficient first performs feature extraction through a 1 × 1 convolution and a 3 × 3 convolution and then divides into two branches. Each branch performs a 1 × 1 convolution and calculates Bbox.loss and Cls.loss, respectively, and the structure is shown in
Figure 8. Detect-Efficient has lower computational complexity than YOLOv8’s detection head.
Additionally, the Distribution Focal Loss (DFL) contributes substantially to bounding box regression by discretizing coordinate predictions to improve localization precision. This study raises the reg max value from 16 to 20 to improve detection performance. This modification achieves more accurate coordinate prediction by increasing the number of discrete bins used to represent each coordinate, which effectively increases the output dimension per coordinate in the regression head. Although this adjustment introduces additional parameters and slightly increases computational cost, it leads to better localization accuracy. With proper optimization strategies, the impact on inference speed remains minimal, making the trade-off worthwhile.
In most cases, many high-resolution photos must be detected in real time to identify solar panel defects. By combining the convolutions of the two branches, Detect-Efficient significantly decreases computational complexity while increasing detection speed.
3.4. WIOU
The YOLOv8 network uses Complete Intersection over Union (CIoU) Loss as its bounding box loss function. This loss function optimizes the centroid distance, overlap area, and aspect ratio difference between predicted and actual boxes, thereby improving the accuracy of the predicted box concerning the target item. However, CIoU Loss does not take into account the equalization of the quality of the training samples during the training process. For instance, the number of samples for some common defects is much larger than that for rare defects, which may lead to a reduction in the training efficiency of the network. In addition, due to the stochastic nature of the matching between predicted and real frames at the early stage of training, unstable gradient updates may further affect the optimization effect of the model and even reduce the final detection performance. To solve this problem, Wise-IoU (WIoU) [
47] enhances the model’s robustness to low-quality data by using a dynamic non-monotonic focusing mechanism and a gradient gain technique. The effect of low-quality samples is minimized, and the model’s focus on normal-quality anchor frames and overall detector performance is enhanced.
For the object detection task, IoU is used to measure the overlap between the ground truth bounding boxes and the predicted bounding boxes. The data used to calculate IoU is shown in
Figure 9. To clarify the equations used, the variables involved are defined as follows: W
t and H
t are the width and height of the intersection area between the target and prediction frames; w and h are the width and height of the prediction frame; w
gt and h
gt are the width and height of the target frame; (x, y) are the prediction frame’s center coordinates; (x
gt, y
gt) are the target frame’s center coordinates; W
g and H
g are the width and height of the smallest bounding frame that contains both the prediction and the target frames. All the above variables are illustrated in
Figure 9.
The WIoUv1 loss function [
47] consists of an IoU-based term and a distance-based term. Its complete formulation is provided in Equation (1), where
is defined in Equation (2), and
in Equation (3).
The superscript * indicates that Wg and Hg are excluded from the gradient computation to ensure numerical stability and remove potential barriers to loss function convergence.
Conventional bounding box loss functions may lead to gradient vanishing problems when both width Wt and height Ht are zero. Introducing WIoU ∈ [1, e) into the distance metric significantly amplifies the LIoU ∈ [0, 1] of the standard quality anchor boxes. This approach reduces the focus on the high-quality anchor box and minimizes the impact on the center of mass distance. Because of the dynamic nature of WIoU, the quality assessment criteria for anchor boxes are adaptively modified, allowing for real-time optimization of the gradient gain allocation strategy based on the circumstances at hand.
In solar panel defect detection tasks, image diversity and complexity present challenges. Blurred or small-sized defects represent low-quality examples that may interfere with the noise of the model training process and provide unreliable gradient signals. WIoU loss relatively reduces the weight of low-quality samples by assigning adaptive weights to different categories (e.g., critical targets and medium-quality samples), thereby mitigating their negative impact on parameter updates. This allows the model to focus more on the detection accuracy of key objects and enhances its handling of medium-quality samples, ultimately improving overall detection effectiveness and performance.
4. Experimentation
4.1. Dataset
The solar panel defect dataset contains 4447 annotated images covering seven key defect types: crack, finger, black_core, thick_line, star_crack, horizontal_dislocation, and short_circuit. Each image is annotated with the location and category of the defect, and a single image may contain multiple defect types. The distribution includes the following categories: Crack (1009 images, 1026 labels), Finger (1502 images, 2949 labels), Black_core (962 images, 1247 labels), Thick_line (775 images, 981 labels), Star_crack (122 images, 134 labels), Horizontal_dislocation (266 images, 798 labels), and Short_circuit (492 images, 492 labels). Given the limited size of the dataset and the severe imbalance in category distribution, we used an 8:1:1 split for training, validation, and testing. The split was performed using random sampling to ensure that each subset retained a representative distribution of all defect types. The dataset contains only 4447 labeled images with a highly uneven defect distribution. For example, Star_crack has only 134 instances, while Finger has nearly 3000. To address this issue, ensuring a sufficiently large and diverse training set is crucial for effective model learning, especially for rare defect types. Although only 10% of the data (445 images) is allocated to validation and testing, we created these subsets through random sampling to maintain a representative distribution of all categories. This method enables reliable evaluation of the model’s generalization performance while maximizing the training samples to enhance stability and robustness.
To enhance the model’s robustness to common disturbances in actual photovoltaic images (such as geometric changes and uneven lighting), various data augmentation strategies were introduced during training. These included random affine transformations (rotation, scaling, and translation) to simulate image geometric deformations, as well as brightness and contrast perturbations to simulate non-uniform lighting. These augmentation techniques significantly improved the model’s generalization ability in complex scenes. Additionally, the multi-branch structure and context modeling mechanism designed inherently possess spatial consistency preservation capabilities, which further mitigated the impact of the aforementioned disturbances on detection accuracy.
4.2. Model Selection
YOLOv8 is an advanced target detection algorithm with five different models that meet different needs. YOLOv8n represents the smallest and fastest model but provides relatively low accuracy. The YOLOv8s, YOLOv8m, YOLOv8l, and YOLOv8x models outperform YOLOv8n but are larger and slower, requiring higher computational requirements and are more challenging to deploy. Considering all factors, YOLOv8n is chosen as the baseline model in this paper.
In terms of resolution selection, this study systematically evaluated the performance of three scales ranging from 640 × 640 to 1280 × 1280 through experimental testing. The results are shown in
Table 1, and 640 × 640 was ultimately determined as the optimal input resolution. The experimental results reveal the following: First, in terms of detection accuracy, the model at 640 × 640 resolution reaches 90% mAP@50 and 63.7% mAP@50-95. Compared with 800 × 800 resolution, mAP@50 is only reduced by 0.5%, but mAP@50-95 is increased by 2.6%. Compared with 1280 × 1280 resolution, mAP@50 and mAP@50-95 increase by 1.4% and 2.0%, respectively, showing better comprehensive detection performance. Second, in terms of computational efficiency, the 640 × 640 resolution achieved an inference speed of 107.1 FPS, representing improvements of 26.2% and 88.5% compared to 800 × 800 and 1280 × 1280, respectively. Additionally, it significantly reduced GPU memory usage, enhancing the feasibility of industrial deployment. The effectiveness of this choice is attributed to the innovative multi-scale feature extraction design of the model. The C2f-DWR-DRB module enhances detection capabilities for small objects through an expandable receptive field. The COT attention mechanism effectively compensates for information loss caused by resolution limitations through global–local feature fusion. This design enables the model to maintain detection accuracy while significantly enhancing practical application value.
4.3. Experimental Environment
The experimental solar panel defect detection setup consists of the following hardware specifications: an Intel Core i7-12700H processor (12th generation) operating at 2.30GHz, with 32GB RAM and an NVIDIA GeForce RTX 3080 Ti GPU. The software environment employs a 64-bit Windows 11 operating system, with Python 3.8 as the programming language. The implementation utilizes PyTorch 2.0.1 as the deep learning framework and key software libraries, including numpy 1.24.4, OpenCV 4.9.0, and CUDA 10.0.
4.4. Indicators for Model Evaluation
In the YOLO series of models, the primary evaluation metrics include P, R, AP, and mAP. Among these, P (precision) refers to the proportion of positive samples correctly predicted as positive samples. R (recall rate) refers to the proportion of true positive samples correctly detected. AP (average precision) is defined as the average of the average precision across different recall rates, which can be obtained by calculating the area under the P-R curve. MAP (mean average precision) is defined as the average of the AP values across all categories.
4.5. Experiments on the C2f-DWR-DRB Module
In the DWR-DRB architecture, DRB modules with different kernel sizes replace the multi-rate dilated depth-wise convolution in the DWR module structure. This study evaluated two modified configurations. The first configuration, C2f-DWR, replaces the bottleneck module in the standard C2f module with the DWR module. The second configuration, C2f-DWR-DRB, replaces the bottleneck module with the DWR-DRB module. The experimental results are shown in
Table 2. The experimental results demonstrate that the C2f-DWR-DRB model exhibits significant advantages over the C2f-DWR model. The model size decreases from 6.2 MB to 6.1 MB. Computational complexity reduces from 8.1 GFLOPs to 8.0 GFLOPs. The parameter count decreases from 2,966,245 to 2,901,925. These improvements enhance the model’s operational efficiency while maintaining detection accuracy, making it more suitable for deployment on resource-constrained edge devices.
4.6. Experiments on the Detect-Efficient Module
Three configurations were tested to evaluate the detection head impact. These include the original YOLOv8 detection head (reg_max = 16), an improved version with increased regression granularity (reg_max = 20), and the proposed lightweight detection head (reg_max = 20). The experimental results are shown in
Table 3. The experimental results show that increasing reg_max from 16 to 20 in the YOLOv8 detector improves recall and mAP@50-95, indicating higher localization accuracy. However, this also leads to a slight decrease in Precision and mAP@50, while GFLOPs, parameter count, and inference time increase. In contrast, replacing the detection head with the proposed detection-efficient design achieves consistent improvements across all key metrics compared to the baseline reg_max = 16: accuracy increases by 0.2%, recall improves by 2.6%, mAP@50 improves by 0.4%, and mAP@50-95 improves by 2.4%. Computational cost is significantly reduced, with GFLOPs decreasing by 1.3 and inference time shortened by 0.4 ms. These results demonstrate that Detect-Efficient benefits not only from more refined bounding box regression (reg_max = 20) but also from improved computational efficiency through optimized structure. This makes it more suitable for real-time or edge device deployment without sacrificing accuracy.
4.7. Error Analysis
The results of the confusion matrix for DCE-YOLO are shown in the figure, which visualizes how the model performs on the data. In addition to the seven types of defects, the figure includes background categories. Each row represents the actual category, while each column represents the predicted category. The rightmost column shows the total misdetection probability for each type. The last row represents the total missed detection probability. Taking crack as an example, the correct prediction rate is 87%, with a 10% missed detection rate. The probabilities of being misclassified as finger and star crack are 3% and 1%, respectively. The main diagonal of the confusion matrix represents the probability that predicted categories align with actual categories. Values closer to 1.0 indicate superior model performance.
Figure 10 demonstrates that five categories achieve correct prediction probabilities of 85% or higher, confirming the accuracy of the proposed model.
4.8. Graphical Analysis
The performance comparison results of DCE-YOLO and YOLOv8 under the same parameter settings (input resolution 640 × 640 px) and training conditions are shown in
Figure 11. Experiments show that DCE-YOLO initially demonstrates slightly lower mAP@50 and mAP@50-95 than YOLOv8 during early training stages. However, DCE-YOLO shows significant advantages in later training stages. The mAP@50-95 improvement is particularly significant, which verifies the effectiveness of the proposed method. In addition, DCE-YOLO performs better in detection accuracy and shows better training stability. Its training curve has more minor fluctuations and is smoother in the interval of 50–150 epochs. With the increase of training rounds, the performance advantage of DCE-YOLO continues to expand and stably maintains the lead in the later stage of training. The experimental results fully prove that DCE-YOLO effectively improves the comprehensive detection performance of the model.
4.9. Comparative Experiments
To validate the superiority of the proposed algorithm, we conducted comparative experiments using the same solar panel defect detection dataset on multiple mainstream object detection models, including YOLOv5n, YOLOv9t, YOLOv10n, YOLO11n, YOLO12n, Faster R-CNN, SSD, and RT-DETR. The experimental results are shown in
Table 4. The experimental results show that the proposed DCE-YOLO achieves the highest values in recall rate, mAP, and F1, outperforming all comparison methods. Compared with two-stage anchor-based models such as Faster R-CNN and SSD, DCE-YOLO improves mAP by 28.5% and 2.0%, respectively, highlighting its superior localization and classification capabilities on complex defect patterns. Even compared with recent transformer-based models (e.g., RT-DETR), DCE-YOLO still achieves a high mAP of 1.0%, demonstrating its efficiency and accuracy on relatively small datasets with structural defect features. To enable a fair comparison with lightweight YOLO variants, the depth and width parameters of YOLOv5n to YOLO12n were standardized. At the same model scale, DCE-YOLO achieves mAP scores that are 1.6%, 1.6%, 2.3%, 2.3%, and 2.9% higher than YOLOv5n, YOLOv9t, YOLOv10n, YOLO11n, and YOLO12n, respectively. These improvements are attributed to the enhanced feature extraction, symmetric perception design, and optimized detection head of the proposed architecture, which collectively contribute to higher detection robustness and better generalization in real-world photovoltaic detection tasks.
4.10. Ablation Experiments
The ablation experiment verifies the improved module’s optimization effect by comparing the improved module’s performance after joining the network. The improved model X represents the network with added modules; the added modules are denoted by “√”, and the unadded modules are denoted by “×”; the experimental results are shown in
Table 5. The introduction of the C2f-DWR-DRB module alone brings significant improvements. Precision increases by 6.2%, mAP@50 increases by 1.2%, and mAP@50-95 increases by 0.7%. Computational complexity reduces to 8.0 GFLOPs. After combining the COT attention mechanism, the model increases mAP@50 by 1.4% and mAP@50-95 by 1.4%, while maintaining high Precision (88.5%). After using the Detect-Efficient detection head, the key breakthrough is achieved: Recall is significantly increased to 87.9%, mAP@50 is increased by 1.8%, mAP@50-95 is increased by 3.8%, and GFLOPs is greatly reduced from 8.1 to 7.2. Finally, the WIoU loss function is introduced to achieve the optimal balance of the model, the precision is increased by 5.8%, the recall rate is increased by 3.7%, the mAP@50 is increased by 2.1%, and the mAP@50-95 is increased by 4.9%. The synergistic effects of all modules comprehensively improve the final model’s performance on key metrics including Precision, Recall, and mAP@50. The model maintains computational efficiency advantages, validating the effectiveness of the improved approach in solar panel defect detection tasks.
4.11. Visualization Results
Figure 12 illustrates the detection effect of the optimized model, where the number of effective detection frames and the confidence level are proportional to the excellence of the model. Analyzing the number of detected frames: The number of effectively detected frames of the three images in
Figure 12b is more than that of
Figure 12a. Regarding the confidence level, the confidence level of the detected frames in
Figure 12b is mostly higher than that of the detected frames in
Figure 12a. The experimental results show that the improved algorithm can accurately extract the target features and obtain better detection results.
4.12. Computational Cost Analysis
To further evaluate the practical feasibility of DCE-YOLO, we analyzed its computational complexity and runtime performance. As shown in
Table 5, the proposed DCE-YOLO achieves a low computational cost of 7.2 GFLOPs, reducing the computational cost by 0.9 GFLOPs compared to the original YOLOv8n using the C2f-DWR-DRB and COT modules. The average inference time per image at a resolution of 640 × 640 is 2.9 ms, enabling real-time processing at approximately 107 FPS. The complete model was trained on a NVIDIA RTX 3080 Ti GPU for 200 epochs with a batch size of 16, taking approximately 2.5 h in total. The testing time on the validation set of 445 images was less than 2 s, highlighting the model’s efficiency and suitability for edge deployment. The experimental results demonstrate that DCE-YOLO achieves a good balance between computational cost and detection performance.
5. Discussion
Solar panel defect detection faces two major challenges: insufficient multi-scale feature extraction and severe class imbalance. To address these issues, this paper proposes the DCE-YOLO model, which integrates several architectural improvements, including the C2f-DWR-DRB module, COT attention mechanism, lightweight detection head (Detect-Efficient), and WIoU loss function. This design balances accuracy and efficiency, providing a comprehensive solution for photovoltaic defect detection.
The DCE-YOLO architecture explicitly considers the geometric priors of solar panel defects, such as cracks, short circuits, and grid breaks, which typically exhibit bilateral or radial symmetry. The DRB module uses convolutions with different expansion rates to create symmetric receptive fields, while the DWR structure fuses semantic and regional information. This design significantly enhances the model’s ability to recognize bilateral and radial symmetric defect structures (e.g., star_crack and horizontal_misalignment). The C2f-DWR-DRB module adopts a multi-branch design, enabling parallel modeling at different receptive field scales. This effectively improves the model’s adaptability to small-scale defects (e.g., finger defects) and large-scale defects (e.g., cracks and horizontal misalignments). Experiments show that replacing this module can improve mAP@50 by 1.2% and mAP@50-95 by 0.7%. It also reduces the amount of calculation to 8.0 GFLOPs, demonstrating excellent detection-efficiency balance ability.
The COT attention mechanism addresses the limitations of traditional YOLO models in context modeling. The COT module captures spatial dependencies between distant regions in an image by introducing the interaction and fusion of static local context and dynamic global context. This effectively enhances the detection capability of fine-grained defects on complex backgrounds. After incorporating the COT mechanism, mAP@50 and mAP@50-95 increased by 0.4% and 0.7%, respectively.
Detect-Efficient is a lightweight detection head designed to improve the accuracy of bounding box regression. Compared to the original detection head in YOLOv8, Detect-Efficient achieves a better trade-off between accuracy and computational efficiency, making it particularly suitable for industrial deployment and real-time applications.
Additionally, the WIoU loss function employs a dynamic reweighting mechanism to reduce the influence of low-quality samples on gradient updates. This enables the model to focus more effectively on high-quality samples, which is particularly beneficial for underrepresented classes such as star_crack and short_circuit, significantly improving recall.
Experimental results demonstrate that DCE-YOLO outperforms mainstream models such as YOLOv8n, YOLOv9t, and YOLOv10n on key metrics, including mAP@50, mAP@50-95, recall, and F1-score. The lightweight and efficient detection head also significantly reduces computational costs, enabling the model to be deployed smoothly on edge devices such as NVIDIA Jetson Xavier. Compared to larger models like YOLOv8l, DCE-YOLO better meets the low-power and low-latency requirements of real-world photovoltaic system monitoring tasks.
The model demonstrates strong generalization capabilities in image testing and real-world inspection scenarios. In the future, we plan to explore knowledge distillation and semi-supervised learning strategies to improve performance in low-annotation environments. We also aim to extend DCE-YOLO to support multimodal inputs, such as infrared and multispectral images, to achieve broader applications in photovoltaic defect detection.
Although DCE-YOLO demonstrates strong performance and architectural innovation, the method still has certain limitations. First, while the proposed C2f-DWR-DRB module and COT attention enhance the extraction of multi-scale and contextual features, the model primarily relies on RGB visual information. This limits its ability to detect defects with weak textures or low visual contrast, such as potential cracks or temperature-induced anomalies, which can be better captured by infrared or electroluminescent modes. Although the WIoU loss mitigates this issue to some extent, rare classes like star_crack and short_circuit remain underrepresented, potentially affecting generalization performance in real-world deployments. Third, although the model is lightweight compared to standard YOLO variants, the integration of multiple modules (e.g., COT, DRB) increases its architectural complexity. This added complexity may pose challenges for deployment on ultra-constrained devices or embedded systems with strict latency and memory budgets. Finally, the current design does not incorporate domain adaptation or transfer learning strategies, which may be essential for adapting to different panel types, imaging conditions, and on-site environments. These limitations suggest that future work should explore multimodal fusion, efficient architecture search, and learning from limited or unlabeled data to further improve robustness and scalability.