3.2. Datasets
The datasets used in this study include the PVEL-AD public dataset [
16], which was jointly released by Hebei University of Technology and Beihang University. The PVEL-AD dataset is publicly available and has been widely used for photovoltaic cell defect detection research. It contains 12 types of PV cell defects with an image resolution of up to 1024 × 1024. Based on industrial inspection standards and defect distribution characteristics, seven representative defect categories were selected for this study, including crack, finger, black core, thick line, star crack, horizontal dislocation, and short circuit. A total of 4500 publicly available images were used for training and evaluation in the experiments.
To further validate the generalization ability of the model, we conducted additional experiments using the Steel Defect Dataset as a supplementary evaluation. This dataset contains defect images from different materials, and although it is primarily used for detecting steel defects, it helps assess the model’s adaptability to different defect types and background complexities. Through this, we aim to verify that the YOLO-PV model not only performs well in photovoltaic defect detection but also shows good performance in other industrial defect detection scenarios.
3.3. Evaluation Metrics
In object detection tasks, the selection of evaluation metrics is crucial for comprehensively assessing model performance. In this study, precision (P), recall (R), and mean average precision (mAP) are primarily employed as the evaluation criteria.
P measures the accuracy of the predicted results and is calculated as follows:
Here, TP (True Positives) denotes the number of correctly detected instances, while FP (False Positives) denotes the number of instances incorrectly detected as targets. A higher precision indicates that the model can effectively reduce the false positive rate.
R reflects the model’s ability to detect true target instances and is calculated as follows:
Here, FN (False Negatives) denotes the true target instances that were not detected. A higher recall indicates that the model achieves more comprehensive coverage of the targets.
In multi-class detection tasks, the mean Average Precision (mAP) is further employed as an overall performance metric, which is defined as:
where
denotes the total number of categories, and
represents the average precision of the i-th category.
In the specific experiments, the commonly adopted evaluation metrics include mAP@0.5 and mAP@0.5:0.95. The former refers to the mean Average Precision computed at an IoU threshold of 0.5, emphasizing the overlap accuracy between the predicted and ground-truth bounding boxes. The latter, in contrast, averages AP values across multiple IoU thresholds ranging from 0.5 to 0.95 with a step size of 0.05, thereby providing a stricter and more comprehensive assessment of localization accuracy and robustness. Consequently, mAP@0.5 is often employed for a quick comparison of the baseline detection capability of models, whereas mAP@0.5:0.95 is considered a more challenging and authoritative evaluation criterion [
17].
3.4. Performance Evaluation and Result Analysis
To comprehensively evaluate the detection performance of the proposed YOLO-PV model, a series of analyses were conducted from four aspects: training convergence, quantitative performance metrics, classification accuracy, and detection reliability.
Figure 6 illustrates the training convergence curves of mAP@0.5 and mAP@0.5:0.95 for YOLO11n and YOLO-PV. As observed, YOLO-PV converges faster during the early training phase and exhibits smoother, more stable curves in the later epochs. It ultimately achieves higher accuracy on both metrics, demonstrating that the enhanced backbone and multi-scale fusion structure improve learning stability and optimization efficiency.
To quantitatively compare model performance,
Figure 7 presents a bar chart of key evaluation metrics. YOLO-PV outperforms YOLO11n in Precision (P), Recall (R), mAP@0.5, and mAP@0.5:0.95, with improvements of 6.7%, 2.2%, 2.9%, and 4.4%, respectively. Additionally, in multi-scale detection performance, YOLO-PV achieves AP_S = 31.8%, AP_M = 56.7%, and AP_L = 62.5%, indicating a stronger capability for identifying small and fine defects compared to the baseline model.
Figure 8 shows the confusion matrices for four representative defect types, including crack, finger, black core, and short circuit. The YOLO-PV model exhibits deeper diagonal regions and fewer off-diagonal errors, reflecting higher classification accuracy and localization precision. Particularly for samples with blurred boundaries or small defect areas, YOLO-PV maintains stable and consistent predictions, confirming its robustness against noise and complex textures.
Furthermore,
Figure 9 presents the Precision–Recall (PR) curves of YOLO11n and YOLO-PV. The YOLO-PV curve encloses a larger area and maintains higher precision even at high recall levels, suggesting a better balance between confidence and recall thresholds. Compared to YOLO11n, YOLO-PV demonstrates more stable detection behavior and reduced performance fluctuation, indicating improved reliability in practical detection scenarios.
In summary, experimental results across multiple metrics confirm that YOLO-PV achieves faster convergence, higher detection accuracy, and stronger robustness in photovoltaic (PV) defect detection tasks, while its generalization capability is further evaluated on additional datasets in the following sections.
3.5. Comparative Experiments
To validate the superiority of the proposed YOLO-PV model over other state-of-the-art object detection frameworks, we conducted a series of comparative experiments. Representative detection algorithms were selected as benchmarks, and their performance was evaluated on the PVEL-AD dataset. The compared models include the two-stage detector Faster R-CNN [
18], several mainstream one-stage detectors such as the lightweight YOLOv5n [
19] and YOLOv6n [
20], the next-generation real-time detector RT-DETR-L [
21], the enhanced DEIM-D-FINE-L [
15], and the recently proposed YOLOv9t, YOLOv10n, YOLO11n [
22] and YOLOv12n [
23].
As shown in
Table 2, different detection algorithms exhibit clear differences in terms of accuracy and computational complexity. Faster R-CNN achieved a precision of 76.9% and recall of 70.3%, but its overall detection accuracy remained relatively low, with a parameter size as large as 136.87 M and extremely high floating-point operations, making it unsuitable for real-time applications. Lightweight models such as YOLOv5n and YOLOv6n required fewer parameters (2.53 M and 4.23 M, respectively) and had lower inference costs, but their detection accuracy was limited, with mAP@0.5 values of only 89.3% and 76.2%. RT-DETR-L demonstrated certain advantages in accuracy (mAP@0.5 of 87.2% and mAP@0.5:0.95 of 62.0%), yet its large model size (32.0 M parameters, 103.5 GFLOPs) hindered deployment in lightweight scenarios. DEIM-D-FINE-L achieved better performance–efficiency balance, with mAP@0.5 and mAP@0.5:0.95 of 90.0% and 61.9%, respectively, while maintaining moderate complexity (31.0 M parameters, 91.0 GFLOPs). The YOLOv9t and YOLOv10n models further improved efficiency and precision, achieving mAP@0.5 scores of 87.7% and 88.9%, and mAP@0.5:0.95 scores of 58.9% and 61.9%, with extremely low model sizes (1.97 M and 2.27 M) and FLOPs (7.6 and 6.5), supporting real-time detection speeds of 201 FPS and 300 FPS, respectively. YOLOv12n is an attention-centric version of the YOLO family that introduces lightweight transformer-based modules, including Area Attention and R-ELAN, to enhance feature extraction and detection stability. With 2.55 M parameters and 6.3 GFLOPs, it achieves a precision of 83.3%, recall of 86.2%, and mAP@0.5:0.95 of 60.9%, offering a good trade-off between accuracy and efficiency. YOLO11n struck a relatively good balance between accuracy and efficiency, achieving an mAP@0.5 of 90.2% and an mAP@0.5:0.95 of 62.4%, with a model complexity of only 6.3 GFLOPs.
In contrast, the proposed YOLO-PV model exhibited remarkable advantages across multiple evaluation metrics. While maintaining a lightweight architecture (2.52 M parameters and 7.5 GFLOPs), YOLO-PV achieved a precision of 90.9% and recall of 84.5%. Furthermore, its mAP@0.5 and mAP@0.5:0.95 reached 93.1% and 66.8%, respectively, both significantly higher than the compared models. These results demonstrate that YOLO-PV not only ensures efficient inference but also achieves superior detection accuracy, while its compact model size makes it suitable for deployment on hardware-constrained platforms, highlighting its practical application potential.
In addition to the overall mAP values,
Table 3 presents the AP values for each defect category. A comparison reveals that YOLO-PV outperforms YOLO11n across all categories, particularly in small-scale defects and medium-scale defects.
YOLO-PV shows significant improvements in AP values across various categories, with notable increases in Crack (from 79.1% to 86.5%), finger (from 94.1% to 95.5%), and star_crack (from 74.6% to 83.4%). These results indicate that YOLO-PV is more capable of detecting small and hard-to-detect defects, particularly in complex backgrounds and small object detection tasks.
To further evaluate the robustness of the YOLO-PV model in photovoltaic panel defect detection, we conducted an error rate experiment comparing YOLO11n and YOLO-PV.
Table 4 presents a comparison of the two models in terms of detection error rate (Ecls), localization error rate (Eloc), total error rate (Eboth), and duplicate error rate (Edupe). The results indicate that YOLO-PV demonstrates significant improvements in all error rate metrics, particularly in Ecls and Eloc, with error rates decreasing by 0.32 and 1.3, respectively.
YOLO-PV’s improvement in Ecls and Eloc demonstrates its significant advantage in enhancing detection accuracy and reducing localization errors, especially in handling complex backgrounds and small defects.
In the proposed method, scale modeling and multi-scale feature interaction are among the key factors contributing to performance improvement. To further validate the effectiveness of the proposed model in detecting objects of different scales, we conducted a comparative analysis between the baseline model YOLO11n and the improved model in terms of detection accuracy on small, medium, and large objects.
Table 5 summarizes the experimental results across different scales, where AP_S, AP_M, and AP_L represent the average precision for small, medium, and large objects, respectively.
The detailed results are presented in
Table 5 and the proposed YOLO-PV model consistently outperforms the baseline YOLO11n across different object scales. The improvement is particularly pronounced for small objects, where AP_S increases from 26.5% to 31.8%, representing a gain of 5.3 percentage points. This demonstrates that the proposed multi-scale modeling and feature interaction mechanism effectively alleviates the limitations of conventional models in detecting small defects, such as missed detections and inaccurate boundaries. For medium and large objects, YOLO-PV also achieves substantial improvements, with AP_M and AP_L increasing by 10.3 and 7.8 percentage points, respectively. These results indicate that the proposed method not only enhances fine-grained defect recognition but also improves robustness and accuracy in detecting large-scale structural defects. Overall, the findings validate the effectiveness and stability of the improved model in multi-scale scenarios, providing more reliable technical support for photovoltaic cell defect detection.
3.6. Ablation Study
To systematically evaluate the feasibility and necessity of each proposed component, an ablation study was conducted using YOLO11n as the baseline, with the results summarized in
Table 6. Under identical experimental settings, different components were introduced incrementally and compared against the detection performance of the original model, thereby clearly revealing the individual contribution of each module to the overall performance. Furthermore, the combination of multiple modules was also examined to validate the synergistic effects of their integration on model performance.
On the PVEL-AD datasets, we evaluated each proposed improvement individually. The symbol “✔” in the table indicates the modifications introduced on top of the YOLO11n model. In addition to the three evaluation metrics (Precision, Recall, and mAP@0.5 and mAP@0.5:0.95) presented in
Section 3.3, we also considered the number of parameters (Param) and floating-point operations (FLOPs).
The experimental results show that replacing the C3K2 modules in the backbone of YOLO11n with the EHMSB module led to an increase of 1.5% in both mAP@0.5 and mAP@0.5:0.95. Further, replacing the original feature pyramid network (FPN) in YOLO11n with the newly designed dual-ESMSAB-enhanced FPN resulted in improvements of 2.6% and 3.1% in mAP@0.5 and mAP@0.5:0.95, respectively. Finally, integrating both the EHMSB and the new FPN into YOLO11n yielded further performance gains, with mAP@0.5 and mAP@0.5:0.95 increasing by 2.9% and 4.4%, respectively.
For YOLO-PV, we reported the results as mean ± standard deviation. Specifically, the mAP@0.5 achieved 93.1% ± 0.14, and mAP@0.5:0.95 reached 66.8% ± 0.15. These improvements of 2.9% and 4.4% over YOLO11n reflect the stability and reliability of the model’s performance across multiple runs. Additionally, YOLO-PV demonstrates excellent FPS performance (320 FPS), ensuring its suitability for real-time detection applications.
These results highlight that YOLO-PV significantly improves detection accuracy, especially in small object and complex background detection tasks, while maintaining real-time performance and a low computational burden, as evidenced by the consistent and reliable experimental data.
3.7. Detection Results and Visualization Analysis
To intuitively demonstrate the performance of the proposed YOLO-PV model in photovoltaic (PV) cell defect detection, representative test samples were selected for visual comparison, as shown in
Figure 10. The upper row presents the detection results of the improved YOLO-PV model, while the lower row displays those of the baseline YOLO11n. It can be observed that YOLOv11n suffers from missed detections and inaccurate localization in certain samples, particularly under challenging conditions such as blurred crack edges or strong background noise. In contrast, the YOLO-PV model achieves more precise localization of defect regions and exhibits higher robustness and stability in identifying subtle defects such as micro-cracks and finger-like cracks. These qualitative observations are consistent with the improvements in quantitative metrics (e.g., mAP and error rates), further validating the effectiveness and practicality of the proposed method.
To further interpret the model’s decision-making process, Grad-CAM (Gradient-weighted Class Activation Mapping) was employed to visualize the feature activation regions, as illustrated in
Figure 11. The Grad-CAM heatmaps provide intuitive insight into where the model focuses when identifying defects. The YOLO-PV model demonstrates concentrated attention on the actual defect areas (e.g., crack lines and black-core boundaries), whereas YOLO11n exhibits more scattered responses with attention partially focused on irrelevant background regions. This indicates that the improved architecture enables the model to capture more discriminative and semantically relevant features, enhancing both feature representation and interpretability.
Overall, the visual detection and Grad-CAM analyses confirm that YOLO-PV not only achieves higher quantitative accuracy but also provides better visual interpretability. The model effectively focuses on physically meaningful defect regions under complex illumination and noise conditions, offering a reliable foundation for automated PV module inspection.
3.8. Generalization of the Proposed Method Across Different Detection Frameworks
To further verify the universality and scalability of the proposed method, it was embedded into several representative object detection frameworks, including YOLOv5n, RT-DETR-L, and YOLOv10n, forming the corresponding improved versions YOLOv5n-PV, RT-DETR-L-PV, and YOLOv10n-PV. The experimental results are presented in
Table 7.
As shown, integrating the proposed method consistently enhanced the detection performance across all models, which demonstrates the effectiveness and generalization capability of the approach. Specifically, YOLOv5n-PV achieved an mAP@0.5 of 92.5% and mAP@0.5:0.95 of 64.7%, showing improvements of +3.2% and +4.0% compared to the original YOLOv5n, while also reducing the model parameters and computational cost (from 2.51 M/7.1 GFLOPs to 2.11M/5.5 GFLOPs). Similarly, RT-DETR-L-PV achieved mAP@0.5 and mAP@0.5:0.95 values of 89.5% and 63.7%, improving by +2.3% and +1.7%, respectively, with a moderate reduction in computational complexity (from 103.5 GFLOPs to 40.5 GFLOPs). In addition, YOLOv10n-PV reached mAP@0.5 and mAP@0.5:0.95 values of 92.9% and 63.8%, with increases of +4.0% and +1.9%, while maintaining lightweight complexity (2.07 M parameters, 6.5 GFLOPs).
These results indicate that the proposed method can be flexibly integrated into different mainstream object detection architectures, yielding significant improvements in both accuracy and efficiency. This confirms that the proposed method possesses strong portability, scalability, and generalization potential, making it a promising enhancement approach for future lightweight detection research.
3.9. Generalization Ability Validation: Experimental Results on Steel Defect Dataset
To verify the generalization capability and robustness of the proposed YOLO-PV model under different data distributions, additional experiments were conducted on the Steel Defect Dataset. This dataset differs significantly from the photovoltaic (PV) cell defect dataset in material texture, background complexity, and defect distribution, making it suitable for evaluating the model’s cross-domain stability and adaptability. The experiments include ablation studies, multi-scale detection tests, and error rate analysis.
In the ablation study, the effects of the enhanced backbone and the feature pyramid network (FPN) were examined individually and in combination. The results are summarized in
Table 8. It can be observed that using either module alone improves performance to some extent, but the combination of both yields the best results, with mAP@0.5 reaching 82.1% and mAP@0.5:0.95 increasing to 49.7%. Meanwhile, the model remains lightweight (2.52 M parameters) and maintains real-time speed (320 FPS). These results confirm that the proposed modules complement each other and jointly enhance the model’s feature extraction and multi-scale representation capabilities.
Next, the multi-scale detection experiments were conducted to investigate the model’s performance at different object sizes. As shown in
Table 9, YOLO-PV maintains stable detection performance across small, medium, and large targets. The improvement is most pronounced for small-scale defects (e.g., micro-cracks), where the AP_S reaches 56.9%, representing a 23.7% increase over the version without the multi-scale fusion strategy. This demonstrates the model’s effectiveness in handling size variations through high-resolution feature integration.
Finally, to evaluate the model’s cross-domain generalization ability, an error rate experiment was conducted using a public steel defect dataset. This dataset differs from PV cell images in texture patterns, defect distribution, and background complexity, providing a strong test for the model’s robustness. As shown in
Table 10, YOLO-PV achieves notably lower Ecls (0.01) and Eloc (8.50), indicating improved stability and precision even under domain shifts. These results demonstrate that the proposed improvements not only enhance PV defect detection but also provide good adaptability to other industrial defect scenarios.
In summary, YOLO-PV achieves outstanding performance across all experiments. The ablation study validates the rationality of the proposed modules, the multi-scale experiments highlight the model’s advantage in small-target detection, and the cross-dataset error rate analysis confirms its strong generalization and robustness. These findings demonstrate that YOLO-PV is not only effective for photovoltaic defect detection but also adaptable to broader industrial visual inspection tasks.