1. Introduction
Climate change is one of the most urgent challenges of the 21st century [
1]. Driven largely by the relentless combustion of fossil fuels, the surge in greenhouse gas emissions has intensified the global push for cleaner, more sustainable energy alternatives [
2]. Amid this transition, renewable energy sources have experienced unprecedented growth, with solar photovoltaic (PV) technologies at the forefront, offering a scalable and increasingly cost-effective pathway to decarbonization of the energy sector [
3].
Photovoltaic solar energy has exhibited exponential growth in recent years, establishing itself as one of the most promising renewable technologies for addressing the climate crisis while meeting the rising global energy demand [
4]. Nonetheless, one of the principal challenges associated with photovoltaic technology lies in achieving optimal conversion efficiency of photovoltaic modules, which is markedly influenced by various environmental factors, among which temperature is especially critical [
5]. Under real operating conditions, photovoltaic modules typically convert only 15% to 20% of incident solar radiation into electrical energy, while the majority is converted into heat, thereby increasing the module’s temperature [
6,
7,
8].
Previous research has explored the influence of wind flow and temperature distribution on photovoltaic modules. Authors such as [
9,
10,
11] affirm that airflow around a photovoltaic system plays a pivotal role in its overall performance by affecting thermal dissipation of the modules. Thermal losses in photovoltaic modules, particularly those based on polycrystalline and monocrystalline technologies, have been extensively analyzed in prior studies and represent a key factor directly impacting system efficiency and energy yield [
12,
13].
Manufacturers provide thermal coefficients based on performance evaluations across various temperature ranges. Commercial crystalline silicon (c-Si) cells generally experience an efficiency loss of approximately 0.45% per degree Celsius increase in temperature, while amorphous silicon (a-Si) cells exhibit lower thermal sensitivity, with efficiency losses of around 0.25% [
14]. This temperature-induced performance degradation is particularly significant in tropical regions such as Colombia, where elevated ambient temperatures and intense solar irradiance can lead to substantially high operating temperatures in the modules.
Despite significant technological advances in photovoltaic systems, effective thermal management remains a major obstacle to maximizing energy conversion efficiency and extending the operational lifespan of these systems. Recent studies indicate that the operating temperature of a solar cell could rise between 20% and 40% above ambient levels, resulting in power losses of up to 25% [
15,
16]. Furthermore, elevated temperatures not only reduce immediate power output but also accelerate aging mechanisms, potentially shortening the life of the system by up to 10% for every 10°C of sustained operation above standard test conditions [
17].
The development of advanced cell architectures, particularly Passivated Emitter and Rear Cell (PERC) technology introduced commercially around 2015, has marked a significant advancement. PERC panels have become the industry standard due to several key advantages, most notably higher conversion efficiencies, with commercial modules typically achieving 19–22% efficiency compared to 15–18% for conventional technologies [
18,
19]. However, even these improved modules remain vulnerable to performance degradation under high-temperature conditions, especially in tropical climates where elevated ambient temperatures and intense solar irradiance create challenging operating environments.
In this context, the application of machine learning and artificial intelligence techniques in photovoltaic performance prediction has gained significant momentum in recent years, with researchers exploring algorithms such as neural networks, support vector machines, and ensemble methods to forecast solar panel output under different environmental conditions [
20,
21,
22,
23,
24]. While recent studies such as that of Asiedu et al. [
25] have shown the effectiveness of artificial intelligence in predicting PV output, their regression visualizations exhibit certain limitations, including skewed data distributions, absence of uncertainty representation, and a lack of distinction between training and test datasets. These limitations reduce the interpretability and generalizability of the model predictions under high-temperature conditions. Other researchers have explored thermal modeling of PV modules [
6] and investigated performance under real conditions [
8], yet most studies apply monolithic modeling approaches that fail to capture the non-linear and regime-dependent nature of thermal losses, particularly under high-irradiance, high-temperature conditions where predictive accuracy is most critical.
To address these research gaps, this study proposes a novel thermal clustering methodology for predicting thermal losses in MonoPERC solar modules under real outdoor conditions. This work investigates thermal loss mechanisms using machine learning models to assess how module operating temperature affects energy performance, while evaluating model robustness across multiple data-splitting scenarios. The main contributions include (1) a K-means-based clustering approach that partitions operational data into three distinct temperature regimes (low: 10–25 °C, medium: 25–40 °C, and high: 40–53 °C) to account for non-linear thermal behavior; (2) a comprehensive evaluation of seven machine learning algorithms under both baseline and cluster-enhanced frameworks; (3) implementation and validation using high-resolution experimental data collected from a PV system at the RADIANT laboratory of Fundación Universitaria Los Libertadores in Bogotá, Colombia (4.65° N, 74.07° W, 2640 m above sea level), providing relevant insights for high-altitude tropical urban climates; and (4) demonstration of how regime-specific modeling significantly improves prediction accuracy, enabling better energy yield forecasting and system optimization.
This paper is organized as follows:
Section 2 describes the experimental setup, data collection methodology, and evaluation of seven machine learning algorithms, including a novel thermal clustering approach.
Section 3 presents the baseline performance results, thermal clustering improvements, and comprehensive visualization analysis of prediction accuracy across all algorithmsm, and discusses the practical implications and significance of the findings for photovoltaic system optimization in tropical climates. Finally,
Section 4 summarizes the key conclusions and outlines future research directions for enhanced thermal loss prediction in PERC technology.
2. Materials and Methods
2.1. Experimental Setup and Data Source
A photovoltaic (PV) system was constructed using a distributed network topology of solar panels in a modular form and Hoymiles microinverters. Unlike traditional systems, where a single inverter receives the energy generated by the entire network of photovoltaic panels, each microinverter receives energy from four photovoltaic panels as shown in
Figure 1. This study’s solar power system is built with twelve 550 W p-type monocrystalline PERC panels from Yingli Solar (Baoding, China). These panels are connected to three 2 kWp microinverters, with four panels managed by each inverter. The microinverters are linked together in a cascade configuration, a design that allows the system’s energy capacity to be easily modified in the future.
In order to measure the temperature of the solar panels, a network of DS18B20 sensors has been added, controlled by a second ESP32 microcontroller. The wiring scheme for these sensors is detailed in
Figure 1. These specific sensors were chosen because their 1-Wire interface allows multiple sensors to be connected on a single communication line, which helps ensure signal quality and noise immunity. They are also well-suited for this application as they can operate in harsh, moisture-rich environments, function over long distances, and operate within a wide temperature range of −55 °C to 125 °C.
A network of 12 P-type PERC solar panels with a maximum output power of 550 W was installed as shown in
Figure 2, and the technical specifications provided by the manufacturer are detailed in
Table 1, which includes electrical parameters such as nominal voltage (42 V) and high conversion efficiency (21.29%).
The experimental setup was located in Bogota (4.65174° N, 74.06630° W) and configured with a 5° tilt angle and a south-facing orientation (180° azimuth) to maximize solar radiation capture. This arrangement enabled the comprehensive collection of data on the system’s performance, focusing on how temperature and energy yield patterns varied under Bogota’s specific high-altitude tropical climate and diverse weather conditions.
2.2. Error Measurement Analysis
The experimental setup described in previous sections includes the following equipment:
DS18B20 digital temperature sensor:
- -
Measurement range: −55 °C to 125 °C;
- -
Resolution: 12 bits, equivalent to 0.0625 °C;
- -
Accuracy: ±0.5 °C (−10 °C to 85 °C normal operating range);
- -
Technology: semiconductor junction (bandgap principle), with an internal circuit integrating an Analog to Digital Converter (ADC).
RS485 Solar Radiation Sensor:
- -
Measurement range: 0 to 2000 W/m2;
- -
Accuracy: ±5% of reading;
- -
Technology: Light sensor assembly using a silicon photodiode and a cosine corrector, with a spectral response of 300 to 1100 nm;
- -
Resolution: 1 W/m2, non linearity <±2%;
- -
Temperature coefficient: <±0.15%/°C.
Hoymiles HMS20004TA power measurement system
- -
Capacity: 4 × 670 W = 2680 W per unit;
- -
Maximum voltage and current: 60 V and 4 × 16 A;
- -
Accuracy: ±2%;
- -
Resolution: 0.1 W.
The temperature range is 10 to 53 °C, the solar radiation range is 0 to 1449 W/m2, and the maximum power output of the solar panel is 550 W. All sensors are calibrated, and each resolution is considered sufficient.
2.3. Data Collection and Preprocessing
The initial dataset underwent a rigorous five-stage cleaning process to ensure its quality. This included removing missing values, standardizing timestamps, synchronizing data points within a five minutes tolerance, detecting outliers using percentile-based methods, and validating overall data quality. This process successfully retained 85 to 95% of the original records, resulting in a reliable dataset of 2844 daily samples collected every 15 s over a seven-day period.
Using a temporal split strategy (60% for training, 20% for validation, and 20% for testing), we trained seven machine learning models: Random Forest, k-NN, MLP, Linear Regression, Ridge Regression, XGBoost, and an optimized SVM. To prevent data leakage, feature scaling was applied only to the training data for algorithms that required normalized inputs. Performance was evaluated using normalized R2, MAE, and RMSE metrics to enable cross-study comparisons.
2.4. Machine Learning Algorithms
The selection of machine learning algorithms is crucial for addressing the thermal loss prediction problem in photovoltaic modules, which exhibits nonlinear characteristics and high dependence on multiple environmental variables. This study evaluated seven representative algorithms from different learning paradigms, ranging from simple linear approaches to complex ensemble methods, aiming to identify the most suitable techniques for modeling the complex thermal dynamics of MonoPERC modules.The following describes the theoretical foundations and specific characteristics of each of these seven algorithms implemented in our experimental framework.
Linear Regression is a simple and widely used supervised machine learning algorithm that models the relationship between input variables (features) and a continuous output (target) by fitting a straight line. It minimizes the difference between predicted and actual values using a loss function, typically Mean Squared Error (MSE). Despite its simplicity, it serves as a foundation for more complex models and works well when the relationship between variables is linear [
26].
Ridge regression represents a regularized linear modeling approach that employs L2 penalty terms to address fundamental challenges in statistical learning. This technique mitigates overfitting phenomena by constraining coefficient magnitudes, while simultaneously resolving multicollinearity issues that arise when predictor variables exhibit high intercorrelation. Ridge regression specifically corrects for multicollinearity in regression analysis [
27].
Linear SVR is a supervised machine learning algorithm used for predicting continuous values based on a linear relationship between input features and the target variable. It is a variant of Support Vector Machines (SVM) adapted for regression tasks rather than classification. Unlike traditional linear regression, which minimizes the mean squared error, Linear SVR introduces an epsilon-insensitive loss function, meaning it ignores errors that fall within a certain epsilon range around the actual target values. Only deviations greater than this margin are penalized, which makes the model more robust to small fluctuations or noise in the data [
28].
The Multilayer Perceptron (MLP) is an artificial neural network composed of multiple interconnected layers of neurons. Unlike the simple perceptron, the MLP can solve non-linear problems thanks to its hidden layers and non-linear activation functions. Its basic architecture consists of an input layer that receives the data, one or more hidden layers that process the information, and an output layer that produces the result [
29].
The Random Forest Regressor is a powerful machine learning algorithm that combines multiple decision trees to create a more accurate and stable prediction model. It operates by constructing numerous decision trees during training and outputting the mean prediction of the individual trees for regression tasks. Each tree is built from a bootstrap sample of the training data, and a random subset of features is considered when splitting nodes, introducing randomness that helps prevent overfitting [
30].
XGBoost, which stands for eXtreme Gradient Boosting, is a highly efficient and scalable implementation of the gradient boosting framework. It was developed by Tianqi Chen and has become widely popular in data science and machine learning due to its performance and flexibility. XGBoost builds models in a sequential manner, where each new model corrects the errors made by the previous ones, making it particularly effective for structured data [
31].
K-Nearest Neighbors Regression (KNN Regression) is a non-parametric, instance-based learning algorithm. It predicts the output value for a new input by finding the K closest data points (neighbors) in the training dataset and averaging their output values [
32].
2.5. Thermal Clustering Methodology
The thermal clustering approach addresses the non-linear relationship between operating temperature and power losses by partitioning the dataset based on thermal characteristics before model training. This methodology is motivated by the observation that thermal loss patterns differ significantly between low-temperature periods (morning/evening operations) and high-temperature periods (midday operations) in Bogotá’s climate conditions.
The algorithm employs seven key thermal characteristics to define the feature space for clustering:
Actual cell temperature: ;
Temperature deviation from STC: − 25 °C;
Solar irradiation: G [W/m2];
Quadratic temperature terms: and ;
Temperature-irradiation interaction: ;
Thermal efficiency: ;
Temperature gradient: .
K-means Clustering Implementation:
Feature normalization using StandardScaler:
Optimal cluster number determination using silhouette score:
where
is the mean intra-cluster distance and
is the mean nearest-cluster distance.
K-means clustering optimization:
where
represents cluster
i and
is the centroid of cluster
i.
The clustering process identified distinct thermal regimes:
Low-temperature regime: 10–25 °C (morning/evening);
Medium-temperature regime: 25–40 °C (transition periods);
High-temperature regime: 40–53 °C (midday operations).
Preliminary analysis of the thermal loss dataset revealed that photovoltaic modules exhibit distinct operational regimes under varying temperature conditions. To address the non-linear relationship between operating temperature and power losses, this study implements a thermal clustering methodology that partitions the dataset based on thermal characteristics before model training.
The clustering approach is motivated by the observation that thermal loss patterns differ significantly between low-temperature periods (morning/evening operations) and high-temperature periods (midday operations) in Bogotá’s climate conditions. Rather than training a single model across the entire temperature range (10 °C to 53 °C), the methodology identifies distinct thermal regimes and trains specialized models for each operational condition.
This approach enables algorithms to capture regime-specific thermal dynamics while maintaining computational efficiency for real-time monitoring applications. The following mathematical framework describes the implementation of the thermal clustering methodology [
33].
2.6. Thermal Loss Calculation and Metrics
The thermal efficiency metric is calculated as the ratio between measured and theoretical power output at 25 °C:
where the theoretical power at 25 °C is given by
with
W/m
2 and
W for the PERC panels analyzed.
Temperature Coefficient Model
The thermal loss patterns follow the standard temperature coefficient equation:
where:
The actual thermal losses are computed as the difference between temperature-corrected expected power and measured power:
where:
Evaluation of the thermal loss prediction models requires comprehensive metrics that capture different aspects of model performance. To ensure robust and comparable assessment across all machine learning algorithms, this study employs a standardized set of evaluation metrics that provide both normalized and interpretable measures of prediction accuracy.
Table 2 presents the mathematical formulations and descriptions of the five key metrics used to evaluate thermal loss prediction performance: normalized Mean Absolute Error (MAE), normalized Root Mean Square Error (RMSE), coefficient of determination (R
2), Pearson correlation coefficient, and raw MAE in watts for practical interpretation. These metrics collectively provide a comprehensive framework for assessing model accuracy, precision, and practical applicability in photovoltaic thermal loss prediction scenarios.
where:
n = number of observations;
= actual thermal loss value for observation i (W);
= predicted thermal loss value for observation i (W);
= mean of actual values;
= mean of predicted values;
= population standard deviation of actual values;
.
2.7. Solar Irradiance and Temperature Profiles
Figure 3 illustrates the solar irradiance measured on 16 April 2025, which reached a maximum value of 1449 W/m
2 at 11:13 h. The profile follows the expected diurnal pattern, with irradiance increasing rapidly during morning hours, reaching peak values around midday, and gradually decreasing in the afternoon. Notable fluctuations in the irradiance curve, particularly during peak hours, likely indicate passing cloud cover affecting direct solar exposure. The total solar energy received throughout the day was 7.21 kWh/m
2, with an average irradiance of 531 W/m
2. These measurements provide crucial input data for modeling photovoltaic system performance, as they represent the available solar resource that drives energy conversion in the PV panels.
The solar resource measurements were conducted using an RS485 Solar Radiation Sensor a silicon photodiode-based pyranometer designed specifically for PV system monitoring applications. This instrument features a dome-shaped diffuser that provides precise cosine correction, ensuring accurate irradiance readings across all solar elevation angles throughout the day. With a measurement range of 0–2000 W/m2, the sensor adequately captured the full spectrum of irradiance conditions at the Bogotá installation site.
Figure 4 presents the hourly temperature profiles of photovoltaic panel surfaces over a one-day monitoring period. The data reveals substantial daily fluctuations in temperature. The thermal patterns follow the solar irradiance cycle throughout all monitored days, showing rapid morning heating from approximately 10 °C, peak temperatures reaching 48–54 °C during midday hours (10:00–14:00), and a gradual afternoon cooling. The temperature ranges varied by day, with Saturday showing the highest peak temperature of approximately 54 °C, while Thursday and Friday exhibited more moderate profiles with peaks around 35 °C.
2.8. DC Power Output Analysis of a Single PERC Module
Figure 5 presents the DC power output profile of a single 550 W p-type PERC monocrystalline solar panel monitored over a typical day. The data captures the power generation pattern throughout daylight hours (approximately 6:00 to 18:00), with peak power outputs reaching approximately 550 W during optimal irradiance conditions around midday (10:00–14:00 h). The monitoring period reveals a characteristic diurnal curve with significant fluctuations attributed to cloud cover and atmospheric conditions typical of tropical highland regions. The profile shows a gradual power increase during morning hours (6:00–10:00), followed by highly variable output during peak solar hours with pronounced oscillations between 150 W and 550 W, indicating intermittent cloud cover. Notable short-duration drops in power generation are observed throughout the midday period, demonstrating the dynamic nature of solar irradiance under variable weather conditions. The power output gradually decreases during afternoon hours (14:00–18:00) until ceasing at sunset.
3. Results
3.1. Baseline Performance Evaluation of Machine Learning Algorithms
The machine learning algorithms were evaluated for their ability to predict thermal loss, and the results showed significant variations in performance. K-Nearest Neighbors (KNN) emerged as the top-performing model, achieving a superior correlation of 0.9612 and low normalized errors (NMAE = 0.0967, NRMSE = 0.2776). This translates to a mean prediction error of only 7.3 W, or 1.3% of the panel’s rated power. This is shown in
Table 3.
Ensemble methods, specifically XGBoost (NMAE = 0.1452) and Random Forest (NMAE = 0.1469), also performed exceptionally well, falling into the top tier of accuracy. The Multi-Layer Perceptron (MLP) demonstrated good potential with an NMAE of 0.1573, while Support Vector Regression (SVR) showed moderate accuracy (NMAE = 0.1832).
In contrast, linear methods such as Linear Regression and Ridge Regression were found to be inadequate for this task, with NMAE values exceeding 0.29. This confirms that linear models are not suitable for capturing the complex thermal dynamics of the system.
The study establishes three distinct performance tiers and provides a comprehensive baseline for various algorithmic categories, which will be used for future evaluation of clustering enhancements.
To address reviewer concerns regarding statistical validity, all performance indicators are reported as the Mean (
) ± Standard Deviation (
) obtained from the cross-validation process, thereby confirming the robustness of the results. This is shown in
Table 4. The evaluation specifically emphasizes the Normalized Mean Absolute Error (NMAE), as it is the most appropriate metric for statistical comparison. NMAE provides a scale-independent measure of error, which is crucial for assessing model generalizability and transferability across different system capacities or power ratings.
3.2. Performance with Thermal Clustering
Table 5 demonstrates significant performance improvements through the thermal clustering methodology, with K-means identification of three distinct thermal regimes that allow specialized model training. Multi-Layer Perceptron emerged as the top performer with correlation of 0.9561 and NMAE = 0.1409, followed closely by K-Nearest Neighbors (correlation = 0.9584, NMAE = 0.1032) and XGBoost (correlation = 0.9525, NMAE = 0.1549). Notably, Support Vector Regression showed substantial improvement with NMAE reducing from 0.1832 to 0.1725, while Random Forest achieved NMAE = 0.1769, representing meaningful enhancement over baseline performance. Linear methods demonstrated moderate improvements, with Linear Regression NMAE improving from 0.2898 to 0.2825 and Ridge from 0.2901 to 0.2835. All algorithms showed improved correlations exceeding 0.90, with the best-performing models maintaining normalized errors below 0.16 standard deviations. The consistent improvement across all algorithms validates thermal clustering as an effective method for boosting the accuracy of photovoltaic thermal loss predictions.
Table 6 presents a focused robustness analysis for the Thermal Clustering Approach, detailing the Standard Deviation (
) and the 95% Confidence Interval (CI 95%) for the NMAE metric across all algorithms. The
values observed for the majority of models under this approach are consistently low (ranging between 0.0359 and 0.0484). This low variability is a critical indicator that the data partitioning results in high consistency and low variance in predictive performance across different cross-validation folds. This finding strongly supports the notion that the improved accuracy is structurally sound and not attributable to random data splits. Notably, the KNN Clustering model, despite a slightly higher
(
), maintains the highest upper bound on its 95% CI (
). This wide, yet high-performing interval confirms that, with 95% confidence, the model’s true predictive capability consistently remains superior to that of the other tested algorithms, validating the robustness of the combined KNN-Clustering strategy.
3.3. Thermal Feature Correlation Analysis
Figure 6 presents the correlation matrix between thermal characteristics used in the clustering methodology. The analysis reveals strong interdependencies among thermal variables, with several key patterns emerging from the data.
Perfect Mathematical Relationships: The correlation matrix confirms perfect correlations (r = 1.00) between Tcell and , validating the mathematical transformation used throughout this study. This relationship serves as an internal consistency check for the thermal measurement system. High Thermal Coupling: Strong correlations are observed between G_irradiance and Tcell_squared (r = 0.98), as well as G_irradiance and temp_irrad_interaction (r = 0.97), demonstrating the significant coupling between solar irradiation and non-linear temperature effects in PERC modules. These relationships confirm that thermal losses are fundamentally driven by the interaction between irradiance and temperature rather than by these variables independently.
Thermal Variable Clustering: The primary thermal variables (Tcell, delta_T, G_irradiance, Tcell_squared) form a highly correlated cluster with correlation coefficients ranging from 0.76 to 1.00. This clustering pattern provides scientific justification for the thermal regime identification methodology, as these variables collectively capture the fundamental thermal state of the photovoltaic system.
Independent Information Sources: Notably, eta_thermal exhibits moderate correlations (0.26–0.34) with other thermal variables, indicating that thermal efficiency captures unique performance characteristics not fully explained by temperature and irradiance alone. Similarly, temp_gradient shows minimal correlations (0.04–0.11) across all variables, confirming its role as an independent temporal dynamics indicator.
These correlation patterns validate the thermal clustering approach by demonstrating distinct groupings of related variables, enabling the identification of thermal regimes that improve model performance across all machine learning algorithms. The analysis supports the hypothesis that thermal losses in PERC modules follow complex, non-linear relationships that benefit from regime-specific modeling approaches.
3.4. Training Correlation Analysis by ML Algorithm
The comprehensive evaluation of algorithm performance across different training data proportions provides crucial insights into model scalability and data efficiency for photovoltaic thermal loss prediction.
Figure 7 presents a systematic analysis of training correlations achieved by seven machine learning algorithms when training data varies from 60% to 85% of the total dataset.
3.5. Prediction Accuracy Visualization
Figure 8 presents scatter plots comparing actual versus predicted thermal losses for all seven machine learning algorithms evaluated in this study. These visualizations provide crucial insights into model performance characteristics and prediction patterns across the thermal loss range of 0–300 W observed in the MonoPERC modules. The data obtained with the temperature, solar irradiance and power sensors have an accuracy of
,
and
, considered sufficient.
The scatter plots reveal distinct performance patterns among algorithms. K-Nearest Neighbors (
Figure 8c) demonstrates exceptional accuracy with minimal scatter around the perfect prediction line, consistent with its superior quantitative metrics (correlation = 0.9612, NMAE = 0.0967). Multi-Layer Perceptron (
Figure 8f) exhibits excellent linearity and tightclustering, validating its strong performance in both baseline and clustered evaluations.
Ensemble methods XGBoost (
Figure 8g) and Random Forest (
Figure 8d) show strong predictive capability with consistent performance across the entire thermal loss range, though Random Forest exhibits slightly more scatter in the mid-range predictions (100–200 W). Support Vector Regression (
Figure 8e) displays good overall correlation but shows increased variance at higher thermal loss values, indicating potential challenges in extreme temperature conditions. Linear methods Ridge Regression (
Figure 8a) and Linear Regression (
Figure 8b) demonstrate the limitations of linear approaches for this application, with notable scatter and systematic deviations from the perfect prediction line, particularly at higher thermal loss values. This confirms the non-linear nature of thermal dynamics in PERC modules and justifies the superior performance of non-linear algorithms.
The visualization analysis supports the quantitative findings and provides practical insights for algorithm selection in real-world photovoltaic monitoring applications, where prediction accuracy across the full operational range is crucial for effective thermal loss management.
The results of this study provide significant insights into the application of machine learning algorithms for thermal loss prediction in MonoPERC solar modules under real operating conditions. The superior performance of K-Nearest Neighbors in baseline evaluations, achieving a correlation of 0.9612 and NMAE of 0.0967, demonstrates the effectiveness of instance-based learning for capturing complex thermal dynamics in photovoltaic systems. This finding aligns with previous research highlighting the capability of non-parametric algorithms to model non-linear relationships without making strong assumptions about the underlying data distribution.
The superior performance of the K-Nearest Neighbors (KNN) algorithm is fundamentally rooted in its ability to execute highly effective local regression, a characteristic that aligns perfectly with the physics and non-linear data structure of thermal losses in PV modules.
The thermal clustering methodology represents a novel contribution to photovoltaic performance modeling, addressing the fundamental challenge of non-linear temperature-power relationships across different operational regimes. The identification of distinct thermal regimes (10–25 °C, 25–40 °C, and 40–53 °C) and the subsequent training of specialized models for each regime resulted in consistent performance improvements across all algorithms. This approach is particularly relevant for tropical climates like Colombia, where extreme temperature variations significantly impact module performance.
The exceptional performance of ensemble methods, particularly XGBoost and Random Forest, validates their effectiveness in handling the complexity and variability inherent in real-world photovoltaic data. These algorithms’ ability to capture non-linear interactions between environmental variables and thermal losses makes them well-suited for practical deployment in monitoring and optimization systems.
The limitations of linear methods, evident in their lower correlation coefficients and higher error rates, confirm the inadequacy of simple linear models for capturing the complex thermal dynamics of PERC modules. However, the improvements observed with thermal clustering suggest that even linear approaches can benefit from regime-specific modeling strategies.
The practical implications of this research extend beyond academic interest, offering valuable insights for photovoltaic system operators and manufacturers. The ability to predict thermal losses with high accuracy enables proactive maintenance scheduling, performance optimization, and improved energy yield forecasting, particularly crucial for large-scale solar installations in tropical regions.
4. Conclusions
This study successfully demonstrates the effectiveness of the proposed thermal clustering methodology for improving the prediction of thermal losses in MonoPERC modules under tropical conditions. While the dataset used was limited to one week, it provided sufficient variability to validate the approach and prove its feasibility. With significant variations among seven algorithms, K-Nearest Neighbors achieved superior performance (correlation = 0.9612, NMAE = 0.0967, prediction errors = 7.3 W).
The novel thermal clustering methodology represents a significant contribution to photovoltaic performance modeling, consistently improving prediction accuracy across all algorithms. The results identify two superior approaches: the non-parametric K-Nearest Neighbors (KNN) model and the deep learning-based Multi-Layer Perceptron (MLP). KNN emerged as the overall top performer (Correlation = 0.9584, ), demonstrating the optimal effectiveness of local regression on highly homogeneous data. The MLP secured the position as the best deep learning architecture (Correlation = 0.9561, ), confirming the critical value of regime-specific modeling for accurately capturing complex, non-linear thermal dynamics.
Key findings include the validation of ensemble methods (XGBoost, Random Forest) as highly effective for photovoltaic thermal modeling, the confirmation that linear methods are inadequate for complex thermal dynamics but benefit from clustering approaches, and the demonstration that all algorithms benefit from increased training data, with performance improvements continuing up to 85% training ratios. As future work, extending the analysis to multi-seasonal and longer-term datasets will further enhance the generalizability of the results and strengthen their applicability across diverse climatic conditions.
This research provides practical implications for photovoltaic system optimization, including accurate thermal loss prediction for maintenance scheduling, improved energy yield forecasting capabilities, and enhanced understanding of PERC module behavior in tropical climates. Future research directions should focus on extending the methodology to other photovoltaic technologies, investigating long-term seasonal variations in thermal performance, and developing real-time implementation frameworks for commercial monitoring systems.