1. Introduction
Photovoltaic (PV) systems are critical components in the energy decarbonisation process; however, there are potential faults that might affect the efficiency of the PV system. These faults include the existence of hot spots, dirt accumulation, shading, busbar defects, and bypass diode faults. Detection of these faults is critical to ensure the continued energy production and useful lifespan of the PV modules [
1,
2]. In the past decade, significant research has incorporated infrared thermography with artificial intelligence and deep learning to diagnose faults in PV modules from thermographs [
3,
4]. However, despite promising results obtained with convolutional neural networks (CNNs) and one-shot detectors, existing datasets remain limited in size and generalisability [
5,
6,
7].
The current research has concentrated mainly on detection rather than energy impact estimation to identify defects [
8,
9]. The techniques that use U-Net and Res-U-Net segmentation models are more accurate in module and cell boundaries in challenging environments; however, these techniques are vulnerable to overfitting [
10].
The coupling of drones (UAVs) with infrared thermography has minimized the time taken during inspections, though radiometric accuracy and calibration still constitute major challenges [
11]. Early automated frameworks combining module localisation, defect segmentation, and cause classification compared networks such as Feature Pyramid Network, U-Net, and DeepLabv3+, highlighting trade-offs between accuracy and computational cost in UAV-acquired data [
12]. Hybrid models incorporating convolutional networks with machine learning capabilities led to an increase in the fault classes to six to seven classes; however, the results are still unbalanced with respect to those that are more frequent and those that are less frequent [
13]. Systematic reviews include various concerns with respect to quality and correlation with energy loss [
14]. Despite advances in segmentation and automation, the energy gap persists. Transfer learning with VGG-16 has simplified training on small datasets and improved hot spot detection, but it has also increased sensitivity to domain shift [
15]. Edge-guided architectures and Mask R-CNN have improved contour accuracy in noisy thermograms, but at the cost of increased computational costs [
16]. Case studies on radiometric inspections and explainable monitoring have proposed predictive workflows for operations and maintenance, but the mapping between failure and power loss remains unstandardised [
17,
18]. Recent machine-learning frameworks for industrial fault diagnosis have demonstrated that data-driven feature extraction and probabilistic modelling can significantly enhance reliability assessment in complex equipment [
19]. Similarly, the integration of deep learning with infrared thermography has confirmed the potential for energy-oriented maintenance, while highlighting the lack of common benchmarks [
20]. Recent studies have reiterated that the combination of thermography and deep learning improves predictive maintenance of PV systems, but standardised datasets and energy-based metrics are still lacking [
21]. Consequently, current research has shifted towards advanced U-Net architectures and ensemble strategies to overcome generalisation limitations and enable scalable, energy-oriented PV diagnostics [
22]. The integration of U-Nets with ASPP and CNN ensembles has improved scalability and multi-scale defect detection on UAV datasets (~10
3 images), increasing IoU and F1 scores. However, challenges such as class imbalance, site-specific domain shift, and field energy validation remain unresolved [
23,
24]. The latest research in IRT for PV systems highlights the need for standardised measurement protocols, thermal normalisation, and energy-oriented performance metrics [
25]. Although these studies show progress in automatic thermogram analysis and AI-based diagnostics, they also highlight the persistent scarcity of large shared datasets, the sensitivity of models to environmental conditions and, above all, the lack of a quantitative link between detected defects and actual energy yield losses.
The proposed model, based on IRT + DL, uses U-Net for multi-class segmentation directly on radiometrically normalised thermal data, integrating environmental control procedures. Despite significant progress, the integration of DL and IRT into predictive maintenance still presents unresolved issues. Recent studies [
26,
27] have examined predictive maintenance strategies for PV systems—including sensory monitoring, machine learning and fault prediction—highlighting the economic and operational benefits of early diagnosis, but without quantitative thermal energy models or standardised validation protocols. Similarly, Ref. [
28] analysed UAV-based thermographic inspection for large-scale PV systems, improving pre-processing and failure pattern analysis, but showing that segmentation accuracy remains highly dependent on environmental conditions and the limited representativeness of existing datasets.
The majority of research has carried out acquisition tasks that are site-specific and one-day acquisition-based. In more recent works, U-Net and ASPP-based architectures have made significant improvements in the area of thermographic defect segmentation with enhanced measures of IoU and F1 values. However, these works focus exclusively on segmentation accuracy, without integrating energy loss estimation or testing robustness under different irradiation and environmental conditions. Similarly, Ref. [
29] highlighted the potential of deep learning for PV monitoring via UAVs but reported the absence of standardised radiometric calibration and common benchmarks.
Other works have highlighted the lack of explainable and energy-efficient frameworks capable of handling environmental variability and class imbalance [
30,
31].
Therefore, segmentation accuracy alone (IoU, Dice, F1) does not guarantee diagnostic or maintenance relevance. To overcome this limitation, future studies should include thermo-electrical validation, correlating thermal defects with power degradation measured under controlled irradiation, for example, through I–V curve tracking or FEM co-simulation [
32,
33]. This work addresses the problem by introducing a labelling scheme based on power losses, which links each defect class to an estimated energy loss range derived from radiometric and efficiency models, translating thermal anomalies into quantitative indicators of energy degradation. A further limitation concerns the lack of standardised datasets, calibrations and acquisition protocols, which reduces the reproducibility and transferability of models due to domain shift. To mitigate this, the present study applies radiometric normalisation and environmental control according to IEC TS 62446-3 [
34], improving reproducibility and comparability, while highlighting the need for inter-laboratory calibration standards. A collaborative benchmark, as is the case in biomedical and industrial thermography [
35,
36], would favour reference datasets for model transferability. The dataset used in this study includes synchronised meteorological parameters (ambient temperature, humidity, wind speed, irradiance) and data acquired from a uniformly calibrated FLIR SC660 thermal camera, ensuring quantitative consistency between acquisitions. Finally, although networks such as U-Net++, DeepLabv3+ and SegFormer achieve high accuracy, they require GPU processing that is not suitable for real-time field applications. Furthermore, Transformer-based models rely on large backbones and datasets, making them impractical for UAV PV inspections, which require compact and energy-efficient architectures capable of on-board inference on low-power devices [
37].
As highlighted in [
38,
39], current research still lacks lightweight architectures specifically optimised for PV thermographic diagnostics. The proposed model, called Efficient Attentive U-Net, fills this gap by integrating a MobileNetV2 encoder with SE blocks, Attention Gates (AG) and an ASPP bottleneck, combining multi-scale contextual perception with computational compactness. This ensures that there is a balance between accuracy and efficiency since this architecture can be employed in real-time applications on drones. The paper connects diagnostic accuracy directly to energy loss levels with a labelling scheme that has a focus on power and uses visual measures (IoU/F1 score) as well as approximated energy effect. Robustness to domain shifts (different days, observation angles and irradiance conditions) is analysed through an ablation study on pre-processing. The main novelty lies in the direct link between diagnostic results and energy yield, which is still underdeveloped in the literature [
15,
18,
20,
25], representing a step forward towards energy-conscious operational and maintenance management. This methodological approach aligns with other multidisciplinary efforts that integrate physical modelling and AI for system health optimisation and monitoring. Overall, although recent works have significantly improved PV thermogram segmentation using U-Net variants, Mask R-CNN, or transformer-based models, current research still suffers from three major limitations: the scarcity of standardized, radiometrically consistent datasets; the strong sensitivity of models to site-specific environmental conditions; and, above all, the lack of a quantitative and reproducible link between detected thermal anomalies and actual energy yield losses. The objective of this work is therefore twofold: (i) to improve segmentation accuracy and (ii) to associate defect detection with quantitative estimation of power losses, aligning thermographic diagnostics with predictive maintenance strategies for PV systems. The most effective approaches will be those which are capable of balancing accuracy in segmentation tasks to a high degree with efficiency and versatility to pinpoint micro-cracks and hot spots. The evaluation pipeline must ensure fair comparisons between models, avoid data leakage and provide statistically consistent performance estimates, reporting both accuracy and complexity indicators to assess readiness for implementation on UAV and IoT systems.
In summary, this work addresses three main gaps identified in recent literature on PV diagnostics based on infrared thermography (IRT): (i) the absence of lightweight segmentation architectures explicitly optimised for use on UAVs and edge devices; (ii) the lack of a quantitative, energy-oriented labelling scheme capable of linking thermal anomalies to corresponding power loss ranges; and (iii) the limited use of statistically rigorous validation protocols on small domain-specific datasets. To this end, the proposed Efficient Attentive U-Net achieves architectural balance thanks to a MobileNetV2 encoder enhanced by AG, SE and ASPP modules, combining compactness and representational capacity. The method is validated on a dedicated thermographic dataset expanded to increase variability, with pre-processing that includes resizing, contrast enhancement, denoising and normalisation. Hyperparameters and generalisation are evaluated using nested cross-validation. The model is compared with the most advanced architectures, and performance is evaluated through pixel accuracy, mIoU, macro F1-score and computational metrics (number of parameters, FLOPs, inference time). The results confirm that the proposed network achieves high accuracy despite its lightweight design, outperforming more complex counterparts and demonstrating strong potential for real-time UAV PV inspections and smart energy infrastructure.
Figure 1 illustrates the workflow of the proposed methodology.
The rest of our paper is structured as follows.
Section 2 details the technical setup of IR acquisition, dataset properties and pre-processing procedures, the nested cross-validation plan, our Efficient Attentive U-Net architecture, and the training protocol.
Section 3 displays our experimental outcomes along with a systematic analysis by way of quantitative measures, per-class analysis, ablation analysis, qualitative analysis, interpretation of accuracy–efficiency trade-offs, methodological strength, and limitations. Finally,
Section 4 concludes our paper by summarising our major contributions and outlining possible directions for future study.
The main contributions of this work can be summarised as follows: (i) a lightweight Efficient Attentive U-Net architecture is designed for infrared thermography applied to PV systems, combining a MobileNetV2 encoder with Attention Gates, Squeeze-and-Excitation blocks and an ASPP bottleneck, achieving a favourable balance between segmentation accuracy and computational efficiency; (ii) an energy-oriented labelling scheme is proposed that associates different thermographic defect classes with their estimated power loss ranges, based on the guidelines of IEC TS 62446-3, thermo-electric efficiency models and cross-checks with I–V measurements; (iii) a rigorous 5 × 2 nested cross-validation protocol is adopted, with augmentation strategies designed to avoid any form of leakage, in order to obtain statistically reliable performance estimates on a small but carefully curated PV thermographic dataset.
2. Materials and Methods
In order to develop a reliable diagnostic framework for detecting PV faults, this section describes the experimental campaign and the methodological steps adopted in this work. The research focuses on two main objectives. The first is to define a rigorous method to collect IR images, respecting the international standards, of the PV systems under investigation in operating conditions, capturing the most frequent classes of anomalies that negatively affect energy yield. The second objective focuses on designing and validating a DL pipeline capable of locating and classifying such faults with high accuracy. The methodological approach follows a structure developed with a detailed description of the characteristics of the PV system under analysis and the configuration of the thermographic measurement, in order to ensure the reproducibility of the acquisition conditions. The detailed explanation of the fault categorisation procedure and the image annotations will represent the link between each type of defect and the associated energy losses. This section describes the pre-processing steps, data augmentation strategies, and dataset splitting, which are essential for improving the robustness of the model under different environmental conditions. The last part of the section introduces the adopted U-Net architecture, training configuration, and validation metrics, selected to evaluate not only classification performance but also the potential energy implications of the diagnosis.
Thermographic monitoring and the energy performance described provide a valid and relevant diagnostic methodology for predictive maintenance in renewable energy systems.
2.1. PV System Description
The experimental activities were conducted on the grid-connected PV system located at the DICEAM Department of the Mediterranean University of Reggio Calabria, in Via Rodolfo Zehender (
Figure 2).
The PV field is located on natural terrain, characterised by a slight slope that facilitates rainwater drainage and minimises the risk of stagnation. The panels are south-facing, with an inclination optimised for the latitude of the site (approximately 38°), in order to maximise solar radiation throughout the year. The absence of buildings or significant obstacles in the immediate vicinity ensures almost constant sun exposure, minimising the possibility of accidental shading during daylight hours. The geographical location of the plant can be identified by the following coordinates: 38.1093° N, 15.6430° E. This logistical configuration was selected to ensure stable and representative operating conditions, which are essential for collecting reliable thermographic data as part of the measurement campaign. The system, selected as a case study, consists of 18 SunPower X Series panels: X22-360-COM [
40] and has the characteristics shown in
Table 1.
This installation was chosen because it represents a typical medium-sized PV system in which fault detection has a direct impact on overall energy yield and operational reliability. In particular, the system had already shown signs of performance degradation during preliminary inspections, making it a suitable candidate for testing thermographic and artificial intelligence-based diagnostic methodologies.
The normal operation of the system allowed the acquisition of thermographic data under load, revealing temperature anomalies associated with defects such as hot spots, bypass diode failures or cell interconnection problems. The measurement campaign, performed on the grid-connected PV system under load conditions, ensured realistic data acquisition directly linked to actual energy losses. In fact, several studies have shown that thermal signatures are often attenuated or even absent when modules are tested in open circuit, while load conditions generate current imbalances that amplify localised heating [
41,
42]. This approach therefore ensures that the anomalies recorded are not only visible from a thermal point of view, but also directly related to actual energy losses, making the dataset particularly suitable for evaluating predictive maintenance strategies in terms of energy efficiency. Additionally, the measured current–voltage (I–V) characteristics were periodically recorded to ensure that detected thermal anomalies corresponded to actual electrical performance deviations.
2.2. IR Acquisition Setup
The thermographic survey stems from the need to collect high-quality infrared images of PV modules under realistic operating conditions, with the aim of capturing both normal and abnormal behaviour. The measurements were taken using a FLIR SC660 (FLIR Systems Inc.
©, Wilsonville, OR, USA) [
43] radiometric infrared camera with a thermal sensitivity of approximately <50 mK, a spectral response in the 7.5–14 µm range and a detector resolution of 640 × 480 pixels (miniaturised bolometer). The uncooled microbolometer sensor ensures stability in environmental measurements, while the dual recording mode (radiometric and JPEG) allows for quantitative thermographic analysis. NIST-traceable calibration ensures adherence to metrological standards. These specifications ensured the ability to detect small temperature gradients, which are essential for identifying early-stage hot spots or subtle interconnection defects. Monitoring was conducted during periods of stable irradiation, typically in the middle of the day, when the modules were operating at or near maximum power. Environmental conditions, including solar radiation, air temperature, and wind speed, were monitored simultaneously using a pyranometer and a weather station to ensure consistent acquisition conditions. Measurements were taken with the PV system under electrical load, as thermal anomalies are most evident when current flows through defective cells or bypass diodes. The camera was placed 3–5 m from the modules and angled to minimise reflections and ensure full field coverage. The thermographic inspection campaign has been performed following the principal international requirements for PV field testing. Specifically, the IEC TS 62446-3 prescribes the requirements for infrared thermographic measurements on PV module and array samples under field operating conditions, obtaining reproducibility and comparability of results. Additional guidelines of IEC 61215 (design qualification and type approval of crystalline module) [
44], IEC 61730 (PV module safety qualification) [
45], and UNI EN ISO 9712 (qualification of non-destructive testing personnel) have been also taken into consideration to ensure technical reliability and operator conformity [
46].
According to IEC TS 62446-3, thermographic inspections should be done when the sky is clear. The modules should get direct sunlight of at least 600 W/m
2 (preferably above 700 W/m
2) to see enough difference in temperature between broken and normal cells. The wind speed should be under 4–5 m/s because wind can cool the temperature differences that help find problems. The ambient temperature and humidity must be watched all the time, as these two things affect how well infrared light passes through the air and how accurate the calibration is. The thermographic inspection was carried out using the traditional method of manual thermography, walking around the PV site and inspecting each module with the thermal imaging camera. The thermographic acquisitions were carried out over a period of two months, from mid-April to mid-June 2025.
Table 2 defines the statistical distribution of the main environmental parameters recorded during the infrared thermographic campaign, compliant with the IEC TS 62446-3 standard.
All thermograms were acquired using a fixed emissivity value of 0.90 for crystalline silicon modules, with radiometric calibration performed using the camera’s internal NUC (non-uniformity correction) and periodic drift compensation. During acquisition, the environmental parameters required by IEC TS 62446-3 were monitored, namely irradiance, ambient temperature, relative humidity, and wind speed, to ensure compliant inspection conditions (G ≥ 600 W/m2, vw < 5 m/s, RH < 75%). These measures ensured adequate radiometric consistency and reduced environmental variability within the dataset.
In total, 500 thermographic images were collected on different days and under different environmental conditions. The raw thermographic data were stored in radiometric format, preserving pixel-level temperature information. From the 500 collected thermograms, 100 original images were manually selected to ensure balanced representation of the six annotated classes (normal, mild hotspot, severe hotspot, bypass, background, shadow) under irradiance between 800 and 1000 W/m2. The remaining 400 samples were obtained via geometric and radiometric augmentation to expand variability while maintaining physical consistency.
This was essential for subsequent pre-processing and normalization, as well as for accurate annotation of fault regions. Representative examples of the acquired thermal images are shown in
Figure 3, where typical fault patterns such as hotspots, shading effects, and soiling are clearly visible.
2.3. Dataset
The test dataset considered in the present work consists of infrared (IR) thermographic images of PV modules that were captured through a focused inspection campaign. It was taken in actual operating conditions, thus ensuring that the environmental and the irradiance profile truly represented typical field conditions.
Out of the collected thermograms, a total of 100 images were selected as representative samples including different defect types as well as normal modes of operation.
They were all captured at native resolution, which was 640 × 480 pixels, in pseudo-RGB format, wherein the thermal intensity values were mapped into a three-channel colormap both for visualization and subsequent processing.
2.3.1. Data Annotation
The dataset was annotated on a pixel-wise basis, assigning each pixel to one of six predefined classes relevant for PV fault detection: (1) background/normal module, (2) hotspot, (3) bypass diode fault, (4) substring fault, (5) soiling/shading, and (6) glass damage. The annotation process was performed manually with expert supervision to ensure accuracy and consistency across images, following established guidelines in thermographic PV diagnostics. The thermographic images collected during inspections of PV systems showed anomalies on specific cells. For this reason, each of these images was subsequently associated with a multi-coloured mask, in which the white pixels identify the cells considered anomalous. The identification of abnormal cells is based exclusively on thermal data. In fact, the operator can immediately recognise the location of the anomaly thanks to the clear deviation in the cell temperature compared to the surrounding cells. Based on evidence provided by scientific literature and field inspections, five representative categories of defects were considered, as shown in
Table 3.
Each of these types of faults has a direct and quantifiable impact on the energy yield of the affected module string, ranging from slight reductions in efficiency to severe power losses of more than 70% in the case of extensive hot spots or defective bypass diodes.
The central role of thermography in identifying defects in PV modules allows for the rapid detection of hotspots, diode failures, and damaged cells directly in the field, facilitating more efficient maintenance [
47,
48]. The use of machine learning algorithms, combined with thermal images captured by drones, improves the automatic classification of defects, provided that the dataset is well annotated [
49,
50]. However, the reliability of thermographic analysis can be influenced by environmental variables and, in some cases, additional checks are necessary to avoid false positives [
51]. For these scientific reasons, the set of thermographic data collected during the measurement campaign was annotated and a pixel-level segmentation approach was adopted, in which defective regions were delineated using polygonal masks on radiometric thermal images. This annotation strategy ensured the generation of ground-truth segmentation masks corresponding to each defect category, providing the basis for multi-class semantic segmentation. In addition to defective areas, areas corresponding to intact cells were also annotated, allowing the model to learn a balanced distinction between anomalies and normal operating conditions.
The masks were initially annotated by a specialist in PV thermography and subsequently reviewed by a second researcher with experience in infrared diagnostics. Ambiguous contours were resolved jointly by consensus. Although this two-step process ensured consistency, no formal metric of agreement between annotators was calculated, an issue that will be addressed in future extensions of the dataset.
The acquired thermographic images were initially saved in JPG format and then the 100 most significant images were loaded into LabelMe
© software (version 5.1.1), which was used to perform a manual semantic segmentation process. Since raw thermal images are affected by variations in ambient temperature, radiation levels and acquisition geometry, radiometric normalisation was applied first [
52]. For each image, polygons were drawn to identify and delimit areas corresponding to potential faults or thermal defects, assigning each polygon a semantic label describing the type of defect observed.
LabelMe © then saved this information in annotation files in JSON format, which contain the geometric data relating to the polygons traced, the labels assigned and references to the source image [
53]. This process transformed the visual information contained in the JPG images into a structured, machine-readable format, which is essential for subsequent statistical analysis, automated processing, or training of defect classification algorithms.
The images in
Figure 4 show some examples of manual segmentation performed with LabelMe © software. They show polygons drawn on areas of the PV module that have potential thermal defects, such as hotspots, soiling, shadows or diode malfunctions. Each polygon is associated with a specific semantic label, which allows the different types of anomalies present in the data to be systematically classified.
Although manual annotation allows for accurate, customised masks for each image, the segmentation phase of thermographic images requires a significant investment in terms of time and resources, especially when working with large data sets [
54]. Segmenting thermal defects is challenging because blurred contours and artefacts can mislead even expert operators. This operation reduced the influence of environmental fluctuations and ensured that thermal contrasts due to defects remained comparable between different acquisition sessions. By explicitly linking each annotated defect to its potential energy impact, the dataset provided not only a solid training basis for the proposed DL model, but also a reference framework for evaluating diagnostic accuracy.
2.3.2. Class Distribution
Each thermogram was manually annotated on a pixel-wise basis using six semantic classes representing the main categories of PV defects: background, hotspot, bypass diode fault, substring fault, soiling/shading, and glass damage. The annotation process was performed under expert supervision to ensure consistency across images.
Table 4 summarizes the class distribution in the original dataset of 100 thermographic images. The statistics report the total number of pixel occurrences per class and their relative percentage with respect to the total dataset.
Although the data augmentation process (
Section 2.3.3) expanded the dataset to 400 images, it preserved the same class proportions, ensuring a balanced distribution among training, validation, and test folds.
As expected, the background pixels dominate the dataset (85.3% of the total percentage).
To establish a quantitative correlation between thermal anomalies and energy yield loss, each defect class was assigned an estimated power loss range, derived from literature data, experimental measurements of I–V curves and the ΔT thermographic criterion defined in IEC TS 62446-3. The correspondence between thermographic defect classes and their associated power-loss ranges is summarized in
Table 5.
These values are consistent with experimental data obtained on crystalline silicon modules under load conditions [
41,
42,
43] and with the quantitative ranges reported in previous thermographic studies and laboratory analysis.
For example, mild hotspots, typically caused by micro-welding defects or partial shading, result in limited efficiency reductions (approximately 1–3%), while severe hotspots or defective interconnections between cells can cause losses of up to 20%.
Bypass diode failures produce current misalignments at the string level, with power reductions of around 30–35%.
Surface phenomena such as soiling or shading generate variable effects (approximately 10–30%), depending on the extent and optical density of the obstruction, while glass breakage or delamination can lead to irreversible degradation of more than 40–50%.
This mapping allows each thermographic class (C0–C3) to be interpreted not only as a visual defect category, but also as an indicator of the module’s potential energy degradation, thus introducing an energy dimension into the evaluation of segmentation results.
The proposed power loss ranges are approximate estimates and not direct electrical measurements. These ranges are derived from a combination of radiometric criteria based on temperature differences reported in IEC TS 62446-3, typical performance degradation values found in the literature for crystalline silicon modules, and empirical observations obtained from the I-V characterization of defective modules. Consequently, these estimates are influenced by several sources of uncertainty, including irradiance variability, assumptions about emissivity, local ventilation conditions, defect morphology, and thermal diffusion effects within the laminate. For this reason, classes C0 to C3 should be interpreted as categories indicative of energy impact, useful for maintenance prioritization, and not as quantitative estimates of instantaneous power loss.
Classes C4 and C5, on the other hand, are reserved for elements not directly related to an electrical defect, such as the background of the image or shadow areas produced by frames and support structures.
It is important to note that IEC TS 62446-3 does not provide fixed power loss values for each defect category, but defines severity classes based on ΔT, which correlate with the electrical stress experienced by PV cells and substring. The power loss ranges reported in
Table 5 were therefore derived as conservative and physically justified estimates, obtained by combining (i) the ΔT thresholds defined in IEC TS 62446-3; (ii) the temperature coefficient of the SunPower X-Series modules used in the study (−0.29%/°C), which allows the reduction in Pmax to be estimated as a function of operating temperature; (iii) the thermo-electrical relationships linking localised overheating to increased series resistance and mismatch losses; and (iv) the ranges reported in previous experimental studies on crystalline silicon modules operating under load conditions.
These ranges should not be interpreted as deterministic measurements, but rather as robust energy labels that allow the proposed segmentation model to provide an interpretable diagnostic output that is aware of the impact on power. A future extension of the work will include a systematic acquisition of I–V curves, with relative cross-validation, to further refine the proposed mapping.
2.3.3. Pre-Processing and Data Augmentation
To enable model training using architectures with pretrained backbones, all images were resized to 224 × 224 pixels, which is a standard input size compatible with lightweight CNN encoders such as MobileNetV2 and with downsampling and upsampling operation, as it is a multiple of 32 [
55,
56]. While 224 × 224 served as the baseline resolution, additional experiments were conducted at 256 × 256 to evaluate the impact of higher spatial resolution on the detection of small-scale defects. In both cases, the original three-channel format was preserved to fully exploit the pseudo-RGB thermal representation.
Prior to training, all images underwent a pre-processing pipeline designed to enhance the visibility of defects and to normalize the input data for robust learning. A denoising filter was first applied to reduce sensor-induced noise while preserving structural details of the module surface [
57]. This was followed by pixel intensity normalization to standardize the dynamic range of thermal values across the dataset, ensuring consistency among different acquisitions and operating conditions. The normalization step is particularly important for thermographic data, as it reduces the influence of varying irradiance or ambient temperature on the model’s predictions [
58].
To increase the dataset size and improve variability, data augmentation techniques were applied, resulting in an expanded dataset of 400 images. Standard geometric transformations included random horizontal and vertical flips, as well as rotations at fixed angles (90°, 180°, 270°) to simulate different module orientations. Additional augmentations were applied to reproduce realistic variations in thermographic inspections:
- -
Brightness and contrast adjustments, emulating changes in irradiance, sensor calibration, or module emissivity.
- -
Gaussian noise injection, simulating thermal sensor disturbances.
- -
Random erasing and partial occlusions, reproducing effects of soiling, shading, or dirt accumulation on the module surface.
- -
Elastic and affine transformations with small intensity, introducing geometric variability while avoiding distortion of fine structural details.
All augmentations were applied with controlled probabilities to maintain a balance between dataset diversity and fidelity to real operating conditions.
To preserve the integrity of model evaluation, a strict anti-leakage policy was enforced. Dataset partitioning into training, validation, and test sets was performed on the original 100 images before augmentation. Consequently, all augmented variants derived from a single original thermogram were assigned exclusively to the same fold, preventing data leakage across different subsets. This strategy ensured that the test sets only contained unseen images, thereby providing a reliable measure of model generalization.
Figure 5 shows an example of data augmentation techniques applied on a PV thermal image.
The resulting dataset, comprising the augmented and annotated thermograms, served as the foundation for all subsequent training and evaluation.
2.4. Proposed DL Architecture: Efficient Attentive U-Net
The proposed DL model, referred to as Efficient Attentive U-Net, was designed to achieve a favourable trade-off between segmentation accuracy and computational efficiency, with a particular focus on enabling real-time deployment in UAV-based inspections and edge computing scenarios. The architecture builds upon the standard U-Net encoder–decoder structure while integrating several improvements that enhance feature representation, attention to relevant regions, and multi-scale context aggregation.
At its core, the encoder is based on MobileNetV2, a lightweight convolutional neural network pretrained on ImageNet. MobileNetV2 employs depth wise separable convolutions and inverted residual blocks, which drastically reduce the number of parameters and floating-point operations (FLOPs) while preserving the ability to extract discriminative features. This choice makes the architecture significantly more compact compared to encoders such as VGG or ResNet, and therefore more suitable for real-time inference on resource-constrained hardware. To further improve the model’s capability to detect small and irregular anomalies in PV modules, AGs were integrated along the skip connections between encoder and decoder. These gates learn to suppress irrelevant activations from the background and to highlight thermally anomalous regions such as hotspots or cracks. The attention mechanism ensures that the decoder focuses on features most relevant to the segmentation task, thereby reducing false positives in large uniform background areas. In addition, SE blocks were incorporated in the skip pathways to adaptively recalibrate channel-wise feature responses. By emphasizing informative channels and suppressing redundant ones, the SE mechanism enhances the discriminative power of the feature maps passed to the decoder. This is particularly useful in thermographic images, where subtle variations in intensity may carry significant diagnostic information.
The bottleneck stage of the network includes an ASPP module, which applies parallel dilated convolutions with different dilation rates. This design allows the model to capture both fine-grained details and larger contextual patterns, enabling robust segmentation of defects that vary in size and shape. The ASPP module enriches the receptive field of the bottleneck without substantially increasing the computational burden.
The decoder reconstructs the full-resolution segmentation map by progressively upsampling the encoded features. Skip connections from the encoder, refined through AGs and SE blocks, are concatenated with the decoder features at corresponding resolutions. Convolutional layers with reduced filter sizes are used in the decoder to maintain a lightweight structure. The final output layer applies a softmax activation function to produce a dense pixel-wise classification into the six predefined classes.
A detailed description of the network configuration is provided in
Table 6, which summarizes the main layers, their output dimensions, kernel size, stride, and padding, assuming an input resolution of 224 × 224 × 3.
The overall architecture is illustrated in
Figure 6, which schematically represents the MobileNetV2 encoder, the attention-enhanced skip connections with SE blocks, the ASPP bottleneck, and the lightweight decoder that reconstructs the final segmentation masks.
2.5. Validation Protocol and Training Setup
The study presents a validation and training system to ensure performance is measured reasonably and reproducibly. Due to the relatively small dataset, we used a nested 5 × 2 cross-validation (CV) framework, which contributed to a robust hyperparameter estimation along with a reasonable out-of-sample performance estimate.
The dataset in the outer loop, meticulously divided into five segments, presents each segment with an appropriate mixture of the six labelled classes. During each iteration, the model was trained on four segments (90% of the data), and the last segment (10%) was kept as a separate test set. The process was repeated five times so that each sample was tested exactly once. The outputs of the outer cycle were then averaged, and the standard deviations were calculated as a measure of how much they deviated.
The model used 2-fold internal cross-validation on each external training partition during hyperparameter search. The internal cycle during the search for the following parameters is:
- -
Learning rate (LR): {1 × 10−3, 5 × 10−4, 1 × 10−4};
- -
Weight decay (WD): {1 × 10−4, 5 × 10−5};
- -
Batch size: {8, 16};
- -
Number of epochs: up to 50 (with early stopping).
The configuration used achieved the highest mIoU value on the internal folds. Outer fold tests were never observed during hyperparameter optimisation, thus preventing any loss and offering independent tests.
For training purposes, the Adam optimiser was used, with β1 = 0.9 and β2 = 0.999 as momentum values. The initial learning rate was determined by optimising the inner loop and was dynamically adjusted by reducing the learning rate by a factor of 0.1 when no improvement in validation loss was observed for five epochs, with the help of a Reduce-on-Plateau scheduler.
To prevent overfitting, training was stopped using an early stopping rule, which terminated training when no progress was observed for ten consecutive epochs. Training took an average of about 30 epochs.
2.6. Loss Functions and Class Imbalance Handling
The segmenting thermal images of PV modules presents some difficulties because background pixels are much more common than defective pixels.
To address this issue, a combined loss function was used, which combines Weighted Cross-Entropy (WCE) and Dice Loss. This combination benefits from both: the ability of Cross-Entropy to penalise pixel misclassifications and Dice overlap-based optimisation.
The Weighted Cross-Entropy part gives more importance to rare classes, such as glass damage or small hotspots, and less importance to the common background. The class weights (
wi) were calculated using the median frequency balancing method, defined as (1)
where
is the pixel frequency of class i and
is the median frequency across all classes. This strategy prevents the model from being biased towards the background while ensuring that small but diagnostically important regions are not overlooked. The loss of Dice directly alters the spatial overlap between the predicted mask
P and the ground truth
G. For each class
i, the Dice coefficient is written as (2)
where
,
, and
denote the number of true positives, false positives, and false negatives, respectively. The Dice Loss is then defined as 1—
, averaged across all classes. This formulation is particularly effective in handling small or fragmented defects, where overlap-based metrics are more informative than pixel accuracy.
The final loss function used for training was a linear combination of the two terms (3):
with
empirically set to 0.5 to balance their contributions.
To verify the robustness of the model with large class imbalances, a study was conducted using Focal Tversky loss. Although it wasn’t included in the final training, this check provided insights into how the model reacts to changes in imbalance management.
3. Results
3.1. Evaluation Metrics
To correctly evaluate the suggested segmentation frame, two types of measurements were used: (i) segmentation accuracy, which controls the quality of the predicted masks, and (ii) machine performance, which measures how well the models work and whether they can be used in real time.
3.1.1. Segmentation Accuracy
The predicted images are compared with truth areas to note the numbers of true positive (TP), true negative (TN), false positive (FP), and false negative (FN), for each class i.
These values are then used to derive the following metrics: PA (4), mean IoU (5), and macro F1-score (6).
Although intuitive, PA may overestimate performance in the presence of class imbalance, as large background areas dominate the evaluation.
The mIoU is the average across all classes. The F1 score considers both FP and FN. This makes it a better choice than PA when checking smaller categories like hotspots or glass damage.
When dealing with multiple categories, the macro F1 score is found by averaging the scores for each category without adding extra weight. This ensures that each category is seen as equally vital, regardless of how often its pixels appear. This score is helpful when the data is uneven because it shows if the model can identify small but key problem areas well.
3.1.2. Computational Complexity
To implement the models for their operation on drones or in edge computing, their computational cost is being evaluated.
- -
Number of learnable parameters (M): This is the total number of modifiable weights, in millions. It shows how much memory the model requires and how likely it is to overfit the data.
- -
Floating-point operations (FLOP): This is the number of calculations needed for one forward pass, in gigaflops (GFLOPs). Fewer FLOPs mean faster processing and lower energy consumption.
- -
Inference time per image (ms): This is the average time to process an image at a defined resolution, measured on the same hardware. It provides a realistic idea of how easy it is to use the model in real-time situations.
The accuracy of segmentation and computational complexity show which models are more precise in identifying defects and which one’s better balance accuracy and speed. Monitoring both parameters is important to understand if deep learning models can be used in functioning PV monitoring systems.
3.2. Evaluation Metric Results
To supervise the model training and find the best objective function for the proposed architecture, several loss functions are examined and compared (
Section 2.6).
The results (
Table 7) show that the combination of Weighted Cross-Entropy + Dice Loss achieved the best overall performance, the highest mean IoU and F1-score values, and maintained stable training. The results confirm the use of the configuration as the final loss function in our Efficient Attentive U-Net.
The performance of the proposed Efficient Attentive U-Net is calculated by selecting a set of widely used segmentation architectures for comparison, including U-Net, U-Net++, Attention U-Net, DeepLabv3+, and SegFormer-B0.
The objective is to evaluate whether the proposed model achieves a superior trade-off between segmentation accuracy and computational efficiency.
To ensure methodological consistency, all trained and validated models use the same pre-processing pipeline (
Section 2.3.3) and the 5 × 2 nested cross-validation scheme (
Section 2.5). Considering the structural differences between the architectures, each trained model has its own optimal hyperparameter configuration, rather than being constrained to a uniform setting. This strategy guarantees that the comparison reflects the best achievable performance of each network.
The adopted configurations are reported in
Table 8.
The hyperparameters for each baseline architecture were selected through an internal optimization process. For each model, a set of simulations was performed by varying the learning rate, optimizer, weight decay, and number of epochs within reasonable ranges, retaining the configuration that achieved the best performance in cross-validation. This model-specific tuning allows each architecture to operate in its most effective regime while maintaining fairness, as all networks share the same pre-processing pipeline, loss formulation, early stopping criteria, and 5 × 2 nested cross-validation protocol. Segmentation metrics are presented alongside computational complexity indicators (number of parameters, FLOPs, and inference time), allowing for a balanced assessment of accuracy and computational cost.
The evaluation results, averaged across the outer folds of the nested CV and reported as mean ± standard deviation, are presented in
Table 9. Metrics include PA, mIoU, and macro F1-score, as well as computational indicators: number of parameters, FLOPs, and inference time per image.
The results confirm that all metrics are consistently high across all model, but the proposed model demonstrates its superiority in metrics that are more sensitive to class imbalance, such as mIoU and F1.
Table 10 defines the computational complexity analysis for all the evaluated models.
In terms of computational cost, the proposed network remains extremely lightweight, requiring only 5.24 M parameters and 10.27 GFLOPs per inference, with an average inference time of 118.64 ms per image. Compared to U-Net++ and Attention U-Net, the model reduces computational requirements by nearly an order of magnitude while still delivering better accuracy. SegFormer-B0 achieves competitive efficiency (3.72 M parameters, 12.64 GFLOPs, 241.53 ms) but does not match the accuracy of the proposed model.
Therefore, carefully optimised CNN-based architectures for PV thermography can still outperform transformer-based solutions on small domain-specific datasets.
The Efficient Attentive U-Net achieves the best trade-off between accuracy and efficiency, demonstrating its potential for real-time implementation in UAV-based PV inspection and integrated monitoring platforms.
3.3. Ablation Study
The ablation study on the Efficient Attentive U-Net evaluates the contribution of each architectural component and the effect of input resolution. The analysis investigated the impact of (i) the ASPP bottleneck, (ii) AGs, (iii) SE blocks, and (iv) the input image size (224 × 224, 256 × 256, and 512 × 512), described in
Table 12. All variants were trained and evaluated using the same nested 5 × 2 cross-validation strategy described in
Section 2.4. Among the tested input resolutions (224 × 224, 256 × 256, and 512 × 512), the configuration with 256 × 256 pixels achieved the best balance between segmentation accuracy and computational efficiency. This setup is therefore referred to as the proposed configuration in the following.
The results, shown in
Table 12, provide important information:
- -
Starting from the MobileNetV2 U-Net baseline, adding the ASPP module improved mIoU by +2.6 points and F1-score by +1.8, confirming the benefit of multi-scale context aggregation.
- -
The introduction of AGs further increased mIoU (+1.7) and F1 (+1.2), demonstrating that spatial attention effectively suppresses background noise and emphasises defective regions.
- -
The addition of SE blocks achieved the best performance at 224 × 224, with mIoU = 78.11% and F1 = 87.53%, while maintaining a lightweight complexity of 5.02 million parameters and 10.28 GFLOPs.
- -
Increasing the input size to 256 × 256 improved segmentations of small or elon-gated anomalies (mIoU +0.8, F1 +0.6 compared to 224 px) with only a modest increase in FLOPs (+25%).
- -
Further enlarging the input to 512 × 512 led to only a marginal gain (+0.35 mIoU, +0.27 F1 compared to 256 pixel), while more than tripling the FLOPs (from 12.46 to 40.37). This highlights that higher resolution provides diminishing returns in accuracy while severely impacting computational efficiency.
Overall, the ablation confirms that each architectural addition—ASPP, AGs, and SE blocks—contributes positively to accuracy, and that 256 × 256 pixel represents the best trade-off between performance and computational cost for UAV-based PV inspection.
Although higher resolutions provided slightly better pixel-wise accuracy, they also increased training time and GPU memory usage by more than 40%. The 256 × 256 configuration was therefore selected as the final setup for all subsequent experiments to ensure an optimal balance between accuracy and efficiency.
3.4. Qualitative Results
In addition to quantitative metrics, qualitative examples were analyzed to further assess the ability of the proposed architecture to segment different types of defects in PV modules. Representative results are reported in
Figure 7, which shows side-by-side comparisons of the original thermographic images, the corresponding ground-truth annotations, and the predicted masks produced by the Efficient Attentive U-Net.
The visual results confirm the quantitative findings and highlight several aspects of the model’s behaviour. In particular, the network demonstrated high sensitivity to subtle temperature variations, accurately identifying small and intense hotspots, often only a few pixels in size, which are among the earliest indicators of soldering degradation or partial interconnection failure [
59,
60].
This behaviour aligns with recent studies showing that early-stage hotspot localization is crucial for preventive maintenance and accurate power-loss estimation in PV systems. The model accurately found small, intense hotspots, often just a few pixels in size. Attention Gates helped reduce background noise, which allowed for precise location of these hotspots. Extended thermal patterns, like those from bypass diode and substring faults, were also accurately marked, showing clear edges and few errors.
The ASPP module was important here because it captured information at different scales, needed to spot these long defects. Similar results were reported in [
61], where multi-scale context aggregation through ASPP improved the robustness of segmentation for elongated and diffuse thermal defects in industrial thermography. This confirms that multi-dilation receptive fields are essential for detecting both macro and micro anomalies under varying irradiance conditions. Even with irregular shapes and low contrast, the system could outline shaded areas, aided by SE blocks that sharpened the key features.
The Squeeze-and-Excitation mechanism also improved the discrimination between low-intensity shading and true hotspots, enhancing class separability and reducing false positives, consistent with observations in recent edge-guided segmentation models for PV thermograms [
62]. Glass damage was the 4.2% toughest problem, but the model did a better job than basic U-Net versions at spotting broken, uneven patterns. This improvement shows the value of attention and channel recalibration methods. Thanks to its compact architecture (5.24 M parameters, 10.27 GFLOPs), the model enables real-time inference on embedded GPUs and UAV platforms for on-site PV inspection.
To better understand how the network made its choices, Grad-CAM was used to view attention maps.
Figure 7 shows examples of these maps, with brighter areas showing where the model focused when it predicted defects [
63]. These views show that the model focused on areas with unusual heat and ignored background noise.
This visual evidence is coherent with interpretability analyses conducted in explainable AI models [
64,
65], confirming that gradient-based attention visualization provides a reliable means to validate the physical coherence of thermal-based defect detection. Such approaches improve confidence in AI-driven monitoring by ensuring that feature activation corresponds to real thermophysical phenomena rather than artefacts or background biases.
3.5. Discussion
The thermographic campaign was intentionally conducted under inspection conditions compliant with IEC TS 62446-3 (irradiance ≥ 600 W/m2, low cloud cover, moderate wind and humidity). As a result, critical scenarios such as heavy cloud cover, high humidity, or intense specular reflections are not represented in the current dataset. Under such conditions, several failure modes may occur: reduction in thermal contrast between healthy and defective cells, appearance of spurious hotspots due to reflections, and partial concealment of real anomalies caused by dense dirt or thin films of water. These issues, similar to those reported in other industrial applications based on thermography, represent an important limitation of the study. Future datasets will include more adverse environmental scenarios to evaluate and mitigate these failure modes.
The analyses conducted confirm that Efficient Attentive U-Net achieves an optimal balance between segmentation quality and processing speed. Compared to standard versions of U-Net and more complex models such as DeepLabv3+ and SegFormer-B0, the proposed network performed better in terms of average IoU and F1, while maintaining a reduced number of parameters and FLOPs. This result supports the central hypothesis of the work: lightweight encoder–decoder architectures, appropriately enriched with attention and multi-scale modules, can outperform heavier models in specialized tasks such as IR thermography for PVs.
The superior performance of Efficient Attentive U-Net on this small dataset stems from the combination of a lightweight but expressive encoder and targeted attention mechanisms. MobileNetV2 reduces overfitting while preserving the ability to extract discriminative thermal patterns; SE blocks and Attention Gates improve the ability to highlight subtle anomalies and attenuate background noise; the ASPP module enriches the multi-scale context, improving the joint segmentation of point hotspots and extended substring-shaped defects. In contrast, heavier backbones such as ResNet-50 (DeepLabv3+) or transformer architectures require much larger and more varied datasets to fully express their representational capacity, making them more prone to overfitting and less efficient in the accuracy–efficiency trade-off in the current context. The combined use of Attention Gates and SE blocks also facilitates the identification of small defects, while the ASSP bottleneck improves the detection of elongated defects such as those of substrings. The use of 256 × 256 images represents a good compromise between quality and computational cost, offering performance comparable to 512 × 512 but with shorter processing times. To ensure robustness, a cross-validation protocol without leakage was adopted, in which each image is tested only once and hyperparameter optimization is performed exclusively on the training sets. This approach is crucial in PV thermography, where limited datasets can lead to misleading accuracy estimates if rigorous validation methodologies are not used.
Despite the results obtained, the dataset remains small (100 original images, increased to 400 through data augmentation), potentially reducing the transferability of the model to larger plants or different environmental conditions. This limitation can be mitigated by merging thermographic UAV data, satellite images, and field sensors. Furthermore, the adoption of a federated learning system would allow training to be distributed across multiple sites without the need to centralize data. Similar strategies, already established in the biomedical and sensor fusion fields, show how data diversity improves model robustness and reduces overfitting [
66,
67]. Although the proposed framework has shown high performance, some limitations must be acknowledged. All thermographic images were acquired at a single PV site over two months under Mediterranean climatic conditions. This ensures experimental consistency and fair comparison between architectures, but limits the climatic, technological, and plant variety of the dataset. Small, domain-specific datasets are inherently more sensitive to domain shift, especially in thermography, where irradiance, viewing angle, ventilation, and ambient temperature strongly influence the thermal appearance of defects. Therefore, the reported performance should be interpreted as representative of the case study, and not as a universal generalization. Generalization will require multi-site and multi-season validations, already planned for future work, including modules from different brands and heterogeneous mounting configuration. Some types of defects, such as glass damage, remain difficult to detect because they are rare and characterized by high geometric variability. As is typical in real PV datasets, categories such as glass breakage or bypass diode failures are naturally underrepresented. The training strategy based on class weights, Dice Loss, and anti-leakage augmentation allowed us to maintain stable performance even on these minority classes, where the model outperformed all baselines. Additional tests with focal loss and moderate oversampling showed improvements in some folds without, however, consistently exceeding the optimal configuration. Future work will explore the generation of synthetic thermograms based on physical models and targeted acquisition campaigns to strengthen the segmentation of rare classes.
Finally, for real-time use, processing times of less than 120 ms per image must be guaranteed on low-computing-power devices. Efficient Attentive U-Net can be integrated into UAV inspection systems and PV monitoring platforms, offering high accuracy and the ability to detect small anomalies with a reduced computational load. With a larger and more diverse dataset, it will be possible to explore hybrid CNN–Transformer models and further optimizations for execution on specific hardware.
4. Conclusions
This paper introduces an Efficient Attentive U-Net for semantic segmentation of infrared thermographic images in PV modules. The design balances segmentation accuracy and computational speed, suitable for UAV inspections and edge computing. The network uses a MobileNetV2 encoder, AGs, SE blocks, and an ASPP bottleneck. Tests showed it performed better than U-Net, U-Net++, Attention U-Net, DeepLabv3+, and Seg-Former-B0. A nested 5 × 2 cross-validation process confirmed the model’s balance in tests: 96.9% Pixel Accuracy, 78.0% mean IoU, and 87.5% macro F1-score, using 5.0 M parameters and about 10 GFLOPs per inference. The results suggest that optimized lightweight CNN designs can beat bigger networks, even with the imbalanced, small datasets common in PV thermography. An ablation study showed how each part and the input resolution helped. Using 256 × 256-pixel inputs gave the best mix of accuracy and speed. Higher resolutions (like 512 × 512) only slightly improved accuracy but increased complexity. Class-wise tests showed good performance for most defect types, but rare issues like glass damage were still hard to identify. Some limits exist; the dataset was small and specific to one site, which could limit how widely the results apply. Also, while the model had good inference speeds on GPUs, using it in real time on embedded systems will need more work. Further optimization can be achieved through pruning, quantization, and neural architecture search (NAS) methods to enable deployment on ultra-low-power IoT processors. This approach aligns with the current trend toward smart edge systems in healthcare and renewable energy monitoring, where lightweight and energy-efficient AI architectures are mandatory for autonomous long-term operation. Although the dataset was acquired from a single PV site over two months (April–June 2025), it encompasses a wide range of irradiance (750–1000 W/m2) and ambient conditions. Additional campaigns at multiple sites and seasons are planned to assess model generalization and robustness to environmental variability. The public release of the dataset and code will further support reproducibility within the Energies community.
Future study will expand the dataset with images from different sites and seasons to improve generalization. It will also look at architectural improvements, like CNN–Transformer models and attention mechanisms for rare classes, and deployment using model compression and quantization for real-time use on UAVs and edge devices.
The energy loss intervals introduced in this work should be interpreted as first-order indicators, based on intervals, and founded on ΔT severity criteria and thermo-electrical relationships, rather than as precise electrical predictions. Since thermal data alone cannot replace a complete electrical characterization, future work will include systematic acquisitions of I–V curves, on-site checks following maintenance interventions, and the integration of an economic performance analysis to quantify the operational benefit of energy-oriented diagnostics.
The Efficient Attentive U-Net shows that carefully designed lightweight deep learning setups can accurately and efficiently diagnose PV faults. This approach can aid predictive maintenance and smart PV energy systems.