1. Introduction
The rapid deployment of Photovoltaic (PV) systems has created an urgent need for scalable, data-driven methods to optimize performance and maintenance strategies. Environmental soiling the accumulation of dust, pollen, and other airborne particulates on module surfaces is a primary cause of efficiency loss. It produces nonlinear, site-dependent reductions in electrical output that vary with climate, season, and operational conditions. Conventional maintenance approaches, such as fixed cleaning schedules or reactive inspections, are often inadequate for large-scale PV plants, as they fail to capture the dynamic nature of soiling.
To overcome these limitations, predictive maintenance frameworks have emerged that integrate high-resolution operational measurements with synthetic datasets. We distinguish between two types of synthetic data: physics-based simulations (derived from physical equations) and regression-based simulated data (generated by statistical models trained on real measurements). This study focuses exclusively on the latter. These frameworks allow the estimation of clean-condition power baselines, precise quantification of energy losses, and dynamic scheduling of cleaning interventions. When combined with statistical regression techniques, they provide robust modeling of PV performance despite multicollinearity, high-dimensional inputs, and measurement noise from environmental variables such as irradiance, module temperature, humidity, and wind speed.
Among regression approaches, Partial Least Squares Regression (PLSR), Principal Component Regression (PCR), Ridge Regression, Lasso Regression, Elastic Net, and Robust Regression are particularly effective. PLSR and PCR reduce dimensionality through latent-variable projection. Ridge and Lasso employ L2 and L1 regularization to improve stability and sparsity; Elastic Net combines both penalties; and Robust Regression mitigates outlier effects caused by shading, inverter clipping, or sensor errors. By leveraging both real and simulated PV datasets, these models enable the generation of adaptive cleaning schedules that closely replicate actual performance degradation, reducing operational costs while maximizing energy yield.
This study builds on these advances by providing a comprehensive evaluation of multiple regression models for PV cleaning schedule prediction, establishing a rigorous framework for predictive maintenance in large-scale, data-constrained PV systems. The following sections review soiling-induced losses, synthetic dataset generation, and regression-based predictive modeling, forming the technical foundation for the proposed approach. The central objective of this paper is to determine how effectively cleaning schedules derived from simulated model outputs replicate those obtained from real operational data, thereby assessing the decision-level reliability and practical suitability of synthetic datasets for PV predictive maintenance.
2. Literature Review
Environmental soiling is recognized as a major determinant of PV system performance, exhibiting strong dependence on regional climatic conditions, atmospheric particulate concentration, and seasonal variability. Empirical studies report energy losses ranging from approximately 2% in temperate regions to over 50% in arid, dust-prone environments [
1,
2,
3,
4]. Conventional maintenance strategies, such as fixed-interval cleaning or reactive inspections following observed performance drops, fail to account for the stochastic and site-dependent nature of soiling, leading to either excessive operational costs or sustained energy yield reductions [
5,
6,
7]. Consequently, predictive maintenance frameworks have been proposed that leverage either high-resolution operational measurements or synthetic datasets generated from physics-based or empirical PV models to determine optimal cleaning schedules [
5,
6,
7,
8,
9,
10,
11,
12].
Notably, while some studies employ physics-based simulation frameworks [
8,
9,
10,
11,
12], the present work generates regression-based simulated datasets using statistical regression models trained on real operational data, which we refer to as “regression-based simulated data” throughout. The generation of high-fidelity synthetic PV datasets has become a critical tool for predictive maintenance. Advanced simulation frameworks integrate irradiance transposition, spectral mismatch correction, dynamic thermal modeling, soiling accumulation kinetics, and diode-based IV curve representations to produce realistic time-series data for DC power, current, and voltage [
8,
9,
10,
11,
12]. These synthetic datasets facilitate the training of predictive algorithms, scenario-based analyses, and uncertainty quantification, particularly in environments where field measurements are sparse, incomplete, or economically prohibitive [
13,
14,
15]. Validation studies indicate that well-calibrated simulation frameworks can reproduce 90–98% of observed PV performance metrics, demonstrating their suitability for operational decision support [
16,
17,
18,
19,
20].
Statistical regression and data-driven modeling techniques have been extensively applied to capture the relationships between environmental variables and PV electrical outputs. Linear and regularized regression approaches including Partial Least Squares Regression (PLSR), Principal Component Regression (PCR), Ridge Regression, Lasso Regression, Elastic Net, and Robust Regression are particularly effective in addressing multicollinearity, high-dimensional predictor spaces, and measurement noise [
21,
22,
23,
24,
25,
26,
27,
28,
29,
30]. PLSR and PCR reduce dimensionality through latent-variable projection, Ridge and Lasso employ L2 and L1 regularization to enhance parameter stability and sparsity, Elastic Net combines both penalties, and Robust Regression mitigates the influence of outliers caused by transient events such as partial shading, inverter clipping, or sensor anomalies [
21,
22,
23,
24,
25,
26,
27,
28,
29,
30]. When applied to synthetic datasets, these methods enable robust prediction of clean-condition electrical baselines, performance degradation trends, and soiling-induced energy losses [
31,
32,
33,
34].
Recent developments in hybrid modeling approaches, which integrate physical knowledge with data-driven techniques, have further improved the generalizability of PV performance models across different module technologies, system configurations, and climatic zones [
35,
36,
37,
38]. Moreover, advanced simulation frameworks now capture detailed electrical characteristics, including nonlinear IV behavior, temperature-dependent resistive losses, and spectral response variability, enabling more realistic synthetic datasets for predictive modeling [
2,
5,
11,
27].
Despite these advancements, comprehensive comparative evaluations of multiple regression methodologies specifically for PV cleaning schedule prediction remain scarce. Existing studies predominantly focus either on continuous electrical parameters or on binary cleaning decisions, rarely considering both simultaneously [
6,
15,
26,
39]. This limitation constrains the assessment of how accurately synthetic datasets can replicate real operational performance and support adaptive maintenance strategies. The present study addresses this gap by systematically evaluating a suite of regression approaches including PLSR, PCR, Ridge, Lasso, Elastic Net, and Robust Regression using both real and synthetic PV datasets. Performance is quantified through correlation coefficients (R), coefficients of determination (R
2), mean absolute deviations, and binary classification metrics, including accuracy, precision, recall, and F1-score [
40,
41,
42,
43,
44,
45,
46,
47,
48,
49,
50,
51,
52,
53,
54]. This framework enables a rigorous comparison of regression methodologies, assessing their capability to generate reliable clean-condition baselines, estimate soiling-induced energy losses, and inform data-driven PV cleaning schedules under variable environmental and operational conditions.
Several recent studies have explored forecasting methods for solar generation. The study [
55] presented a comprehensive analysis of predictive models for forecasting solar generation in microgrids, evaluating multiple methodologies through three error metrics and benchmarking against average base values. However, their study focused exclusively on microgrid systems, which limits its applicability to larger PV installations.
Another forecasting method is discussed in [
56], which is based on an exponential smoothing approach, with a comparative evaluation of four exponential smoothing methods presented. The results demonstrate the importance of tailoring forecasting models to the specific temporal structure of PV power. In [
57], the author discusses the analysis and performance evaluation of irradiance forecasting models, using statistical indicators to compare predicted output with actual output from experimental trials. However, humidity and wind speed were not included as predictors in their study, representing a limitation that the present work addresses through comprehensive environmental feature integration.
3. Methodology
This study adopts a two-stage predictive maintenance framework to rigorously evaluate whether statistically simulated Photovoltaic (PV) datasets can reliably substitute real operational measurements to solar panel cleaning schedule prediction. The objectives of the proposed framework are to preserve physical interpretability, ensure statistical consistency, and enable decision-level comparability between real-data-driven and regression-based simulation-driven maintenance strategies.
The proposed workflow integrates two sequential and tightly coupled stages:
In Stage 1, multiple validated statistical regression models are employed to estimate the clean-condition DC power baseline of the PV system using environmental and operational inputs [
50,
51,
52]. These models learn the functional relationship between meteorological variables (e.g., plane-of-array irradiance, module and ambient temperatures, wind speed, and humidity) and the electrical output of a continuously cleaned reference string (see,
Section 4.1). The resulting regression outputs represent idealized clean-condition power trajectories that are subsequently used to quantify soiling-induced performance losses. By comparing predicted clean power with measured soiled power, a Soiling Loss Index (SLI) is computed, forming a physically meaningful indicator of surface contamination and degradation.
In Stage 2, the problem is reformulated as a binary classification task, where the objective is to determine whether cleaning is required at a given time step. A supervised machine learning classifier is trained using features derived from the SLI, rolling statistical descriptors of soiling evolution, environmental interactions, and temporal variables. The classifier outputs a Cleaning Demand Flag (CDF), translating continuous power degradation estimates into actionable maintenance decisions (see,
Section 4.2). Importantly, this classification stage is executed under two parallel paradigms:
- (i)
Features derived from real measured clean-power data;
- (ii)
Features derived from regression-based simulated clean-power outputs.
This parallel evaluation enables a direct and controlled comparison of cleaning schedules generated from real versus regression-based simulated datasets.
The two-stage architecture offers several advantages. First, it decouples physical power modeling from decision-making logic, allowing each stage to be independently optimized and validated. Second, it ensures that any discrepancies between real and regression-based simulated data are evaluated not only at the signal level but also at the operational decision level, which is the ultimate concern for maintenance planning. Finally, the framework supports scalability and transferability, making it suitable for PV systems operating under data-scarce or sensor-limited conditions.
A schematic overview of the complete methodology is provided in
Figure 1, illustrating the flow from data acquisition and preprocessing through regression-based power modeling, soiling loss estimation, and the machine-learning-based cleaning decision framework, as well as the comparative validation between real-data and simulation-based paradigms.
The following section describes the modeling approaches employed in both stages of the proposed framework, including the statistical regression models and training protocols used for clean-condition power prediction in Stage 1 and the machine learning–based classifiers applied in Stage 2 for soiling loss interpretation and the cleaning decision framework.
4. Dataset and Preprocessing
The dataset examined in this study is recorded at the Shams Centre, a solar energy research station operated by the German University of Technology in Oman (GUtech). Data collection was carried out using a sensor network and monitoring points to ensure quality and high granularity, rendering it an excellent basis for machine learning–based predictive cleaning models.
To guide the reader through the dataset description,
Table 1 provides a concise summary of the key characteristics of the Shams Centre dataset. Following this overview,
Section 4.1,
Section 4.2,
Section 4.3,
Section 4.4,
Section 4.5,
Section 4.6 and
Section 4.7 detail the data sources, simulation framework, preprocessing steps, feature engineering, and data provenance.
4.1. Data Sources
This study relies on high-resolution, real-world operational measurements collected from the 30 kW Shams Centre Photovoltaic (PV) research facility at the German University of Technology in Oman. The dataset provides detailed electrical and environmental records obtained under actual operating conditions, forming the primary empirical foundation for model development, training, validation, and performance assessment. The measurement infrastructure, sensor specifications, data acquisition procedures, and data quality assurance protocols are described comprehensively in [
50,
51], ensuring the reliability and integrity of the dataset.
The 5 min resolution dataset encompasses over 235,000 multivariate time-series observations. Beyond the variables summarized in
Table 1, it includes validated logs of both manual and automated cleaning activities. This precise temporal alignment between environmental conditions, soiling accumulation, and operational interventions is critical for accurate model training and evaluation.
While these high-quality measurements provide a rich and reliable foundation, real-world data can be limited by operational gaps, sensor noise, or coverage constraints. To complement the empirical dataset and enable systematic, controlled assessment of model performance, the study also incorporates simulated clean-power datasets, described in the following subsection. These simulations provide baseline clean-condition power outputs free from soiling effects, allowing direct comparison with measured data and evaluation of the predictive capability of the PV cleaning models.
4.2. Simulated Clean-Power Datasets
To complement real-world measurements and provide a controlled baseline for model development, this study employs regression-based simulations to generate clean-condition PV power outputs. It is important to clarify that these are not physics-based simulations. Rather, they are statistically generated datasets produced by regression models trained on real operational measurements from the Shams Centre facility.
The regression-based simulation framework leverages a set of carefully selected statistical models: Partial Least Squares Regression (PLSR), Principal Component Regression (PCR), Ridge Regression, Lasso Regression, Elastic Net, and Robust Regression. These models were chosen for their ability to capture linear relationships, apply regularization, and tolerate noise while handling the multicollinearity and skewed distributions characteristic of PV operational data.
For each model, the learned relationship between environmental inputs (irradiance, temperature, humidity, wind speed) and clean-power output is used to generate synthetic clean-power values under the same environmental conditions. These regression-based simulated datasets represent the expected PV system behavior under clean conditions as learned from historical data, providing a theoretical reference signal for quantifying soiling-induced losses.
The key distinction is therefore as follows:
- -
Real data: Directly measured from sensors;
- -
Regression-based simulated data: generated by statistical models trained on real data;
- -
Physics-based simulated data: generated by physical equations (not used in this study).
These regression-based simulated baselines provide an essential point of comparison for real measurements, enabling the identification of deviations caused by soiling, environmental variability, or operational uncertainties. By removing the effects of soiling and operational noise, the simulations allow for a controlled assessment of model performance and the evaluation of predictive strategies under ideal conditions, while also enabling systematic comparison of cleaning decisions derived from real versus regression-based simulated data.
4.3. Data Cleaning and Temporal Alignment
To ensure data reliability and consistency, all datasets both real and simulated were synchronized to a unified timestamp format. Invalid or physically implausible values were removed, and temporal consistency was verified using diurnal voltage and power profiles. This process ensured that electrical, meteorological, and cleaning-log records were fully aligned and structurally consistent, forming a robust foundation for subsequent feature extraction and predictive modeling.
4.3.1. Initial Data Inspection and Correlation Analysis
Initial data inspection and pre-processing included timestamp alignment, correlation analysis, and numerical integrity checks. After standardizing timestamps and aligning all records to a single datetime index, additional temporal features were extracted to capture daily and operational patterns. A correlation heatmap (
Figure 2) revealed strong positive relationships between irradiance, temperature, and power, alongside negative correlations with soiling variables. These insights guided feature selection and informed the design of predictive models by highlighting the primary drivers of PV system performance.
4.3.2. Temporal Standardization and Numerical Integrity Checks
Time-based features capturing diurnal and seasonal variations were derived from the unified timeline, including hour, minute, day of the week, day of the year, and a weekend flag. The accuracy of this temporal alignment was confirmed using voltage–time curves (
Figure 3), which displayed the expected daily trends: a gradual morning rise, a stable midday peak, and a slow afternoon decline.
All numerical fields were standardized and checked to ensure realistic measurements. Irradiance values were confirmed to be non-negative, DC power remained non-zero during daylight hours (except at sunrise or sunset), temperatures fell within realistic climatic ranges, and voltage/current readings showed smooth transitions without abrupt spikes. No data points required removal, highlighting the robustness of the measurement infrastructure. Additional validation compared DC power distributions against irradiance patterns, confirming consistency across the unified timeline (
Figure 4).
Feature scaling was applied to ensure all variables contributed equally to model training. Continuous variables were standardized using z-score normalization, while categorical and bounded variables were normalized using Min–Max scaling to a [0, 1] range. Proper scaling and normalization prevent variables with larger magnitudes from dominating model training and help improve convergence and predictive performance. The following subsection describes the methods used to standardize continuous variables and normalize bounded or categorical features, producing a harmonized dataset ready for robust model development.
4.4. Scaling and Normalization
With the datasets now clean, temporally aligned, and verified for numerical integrity, the next step is preparing features for modeling. This involves transforming both raw and engineered variables to ensure consistency in scale and suitability for machine learning algorithms. Proper scaling prevents variables with larger magnitudes from dominating model training, improving convergence, stability, and predictive performance.
Continuous variables, including temperatures, irradiance, DC and AC power, and other electrical outputs, were standardized using z-score normalization, centering each variable at zero with unit standard deviation. This ensures that differences in magnitude do not bias the learning process and allows models to weigh each feature appropriately.
Bounded or categorical variables, such as soiling indicators, encoded time-based features (hour of the day, day of the week, weekend flags), and operational markers, were normalized using Min–Max scaling, transforming all values to a uniform range (typically 0–1) while maintaining relative relationships.
Figure 5 illustrates the integration of both standardization and normalization across the dataset, combining primary sensor measurements with engineered features.
The resulting dataset is consistent, harmonized, and model-ready, providing a robust foundation for PV cleaning prediction. By standardizing and normalizing both raw and derived features, the models can effectively learn relationships between environmental conditions, electrical outputs, soiling effects, and cleaning requirements.
This fully prepared dataset now allows for a comprehensive statistical summary, which characterizes the distributions, variability, and interrelationships of all cleaned and normalized variables, providing essential insights prior to model development.
4.5. Statistical Summary of the Cleaned Dataset
With the datasets now clean, temporally aligned, and properly scaled, it is essential to quantitatively evaluate their distributions, variability, and interrelationships before proceeding to model development. This statistical overview ensures that all features both raw and engineered are well-characterized and suitable for learning PV cleaning patterns. It also provides insights that guide the selection and construction of additional predictive features.
A comprehensive descriptive summary of the cleaned dataset is presented in
Table 2,
Table 3 and
Table 4. These statistics provide insight into central tendency, variability, and operating ranges across electrical, environmental, and soiling-related variables. The summary supports validation of data integrity, assessment of sensor consistency, and verification of assumptions applied during model development. The presence of realistic ranges, appropriate scaling, and physically consistent relationships indicate that the cleaned dataset is suitable for downstream feature engineering and machine learning tasks.
Correlation analysis was performed where strong positive correlations were observed between irradiance and power outputs, while soiling indicators showed negative correlations with electrical performance. Time-based features, including hour of the day, day of the week, and seasonal markers, displayed clear daily pattern and seasonal patterns in PV behavior, thus validating their inclusion as engineered predictors for cleaning schedule models.
This statistical characterization establishes a quantitative foundation for feature engineering, highlighting which variables carry meaningful signals and how they relate to one another. With this understanding, the study can now proceed to the construction and optimization of features that capture temporal dependencies, environmental influences, and soiling effects, producing a dataset ready for robust predictive modeling.
4.6. Feature Engineering
To improve the model’s ability to predict cleaning needs, we engineered a set of features designed to capture non-linear PV behavior, environmental interactions, and the temporal dynamics of soiling progression. These features are organized into five categories, summarized in
Table 5, which address distinct aspects of PV performance and degradation.
Table 5 summarizes these categories and their purpose before the detailed presentation in
Section 4.6.1,
Section 4.6.2,
Section 4.6.3,
Section 4.6.4 and
Section 4.6.5.
4.6.1. Environmental Interaction Features
PV performance is driven by the combined, non-linear interaction of multiple environmental factors. To capture these coupled effects, we derived three interaction features based on domain knowledge:
i. Module–Ambient Temperature Difference (Δ
T):
This indicates the module’s heat dissipation efficiency. Higher values correspond to reduced convective cooling and increased voltage drop, capturing thermal stress beyond absolute temperatures.
ii. Irradiance–Temperature Interaction:
This interaction term represents the competing effects of high irradiance, which boosts current, and high temperature, which reduces voltage a relationship characteristic of desert environments.
iii. Humidity–Wind Interaction:
This combined feature captures the net effect of humidity, which promotes particle adhesion, and wind speed, which enhances cooling and can mechanically remove dust, on both soiling dynamics and thermal management.
4.6.2. Performance Normalization and Degradation Indicators
To isolate soiling effects from variations in solar resource, we normalized PV output by incident irradiance:
Under clean conditions, this ratio remains relatively stable for a given module temperature range; significant deviations indicate performance degradation attributable to soiling, module aging, or other non-radiative losses.
To further discriminate between soiling and electrical faults, we monitored changes in DC voltage and current relative to expected values. Current reductions typically signal soiling or shading, as accumulated dust primarily impedes light transmission to the cells. Voltage drops, in contrast, may indicate increased series resistance, bypass diode activation, or other electrical issues. Tracking both parameters enables the model to distinguish between different failure modes and improves the specificity of cleaning predictions.
4.6.3. Soiling Metrics and Rolling Statistics
Soiling accumulation was quantified using the Soiling Loss Index (SLI) relative to a continuously cleaned reference string:
where
is the power of string
and
is the clean reference. A timestamp is labeled as requiring cleaning (
) if
, indicating significant power loss.
To capture the temporal dynamics of soiling, we computed rolling statistics over multiple time windows. For a series
and a window size
, the moving average is
The following rolling statistics were computed for the SLI over 1 h, 3 h, and 24 h windows:
- -
Mean SLI: Captures short- and long-term soiling trends, smoothing transient fluctuations caused by passing clouds or sensor noise.
- -
Standard deviation of SLI: Quantifies variability in soiling levels, helping detect sudden dust deposition events from sandstorms or partial cleaning from light rainfall.
- -
Rate of change (ΔSLI/Δt): First derivative of the SLI time series which identifies rapid soiling events requiring immediate intervention, such as those following a major dust storm.
A smoothed 24 h moving average was also applied to filter out short-term noise and highlight gradual accumulation patterns, enabling the model to distinguish between slow, continuous soiling buildup and abrupt changes requiring urgent maintenance action.
4.6.4. Temporal and Cyclical Features
PV performance and soiling accumulation exhibit strong temporal dependencies driven by daily solar cycles, seasonal weather patterns, and maintenance history. To enable the model to learn these periodic behaviors without imposing arbitrary discontinuities, temporal variables were encoded using sine and cosine transformations. For a given cyclic variable with period
, the transformations are
This encoding preserves the natural continuity of cyclical features, for example, hour 23 and hour 0 are adjacent on a circle, allowing the model to recognize smooth daily transitions in irradiance and temperature. The following temporal features were encoded using this approach:
- -
Hour of the day ( = 24 h): Captures diurnal patterns in solar irradiance, module temperature, and soiling visibility.
- -
Day of the year ( = 36 days): Models seasonal variations in solar elevation, ambient temperature, and dust accumulation rates.
- -
Month of the year ( = 12 months): Provides a coarser seasonal indicator that complements day-of-year encoding.
4.6.5. Final Feature Set for LightGBM
In addition to cyclical encodings, we engineered a feature representing the time since the last cleaning or rainfall event. This variable captures the influence of recent maintenance or natural washing on current soiling levels: longer elapsed times generally correlate with higher soiling accumulation, while recent events temporarily restore clean conditions. This feature enables the model to account for the resetting effect of cleaning interventions and rainfall when predicting current soiling losses.
4.7. Data Provenance and Simulation Boundaries
To ensure complete transparency in our two-stage framework, this subsection explicitly defines the provenance of all data and confirms the independence of the two modeling stages.
4.7.1. Purely Measured Data
The following data types originate exclusively from physical sensors at the Shams Centre facility and are never simulated or algorithmically modified:
- -
Environmental measurements: Plane-of-array irradiance (), module temperature (), ambient temperature (), wind speed (), and relative humidity ();
- -
Electrical measurements from soiled PV strings: DC power (), DC current, and DC voltage (recorded continuously regardless of soiling state);
- -
Clean-period electrical measurements: DC power, current, and voltage recorded during confirmed clean periods (used exclusively for training);
- -
Operational logs: Manually recorded cleaning events and automated cleaning system logs;
- -
Meteorological events: Rainfall measurements.
These measured data serve as the immutable foundation for all model training and validation.
4.7.2. Training Data for Stage 1 Regression Models
The regression models in Stage 1 (PLSR, PCR, Ridge, Lasso, Elastic Net, Robust) are trained exclusively on a subset of purely measured data: specifically, time periods identified from cleaning logs when PV modules were confirmed clean. These clean periods are defined as the 24–48 h window immediately following a documented cleaning event, before significant soiling re-accumulation occurs.
During these clean periods, the measured DC power represents the true clean-condition baseline for that specific environmental context. This clean-period data forms the training set for all regression models. No simulated data of any kind is used during this training phase.
4.7.3. Regression-Derived Simulated Data
The term “regression-based simulated data” refers strictly to the output of the trained regression models when applied to any input timeframe. Mathematically, for any timestamp with measured environmental inputs
, the simulated clean-condition power is
where
represents any of the trained regression models. This value
is what we term “regression-based simulated data” it is a statistical inference of what the power would be under clean-conditions, generated by a model trained on historical clean-period measurements. It is not derived from physical soiling equations and is not a direct measurement.
Crucially, these simulated values are generated only after all Stage 1 models are fully trained and fixed.
4.7.4. Summary Table: Data Categories
Table 6 summarizes the data categories, their sources, usage, and whether they are simulated, providing a quick reference for understanding data provenance throughout the framework.
4.7.5. Independence of Stages and Absence of Feedback Loops
A critical design feature of our framework is the complete independence of the two stages. There is no feedback loop between Stage 2 and Stage 1. The regression models in Stage 1 are trained once on historical clean-period data and remain static during Stage 2 classification. The simulated clean-power outputs generated by Stage 1 serve as fixed inputs to Stage 2, but classification results do not retrain, update, or otherwise influence the regression models.
This unidirectional flow is illustrated schematically in
Figure 6. The separation ensures experimental integrity when comparing cleaning decisions derived from real versus simulated data, as any differences can be attributed solely to the fidelity of the simulated clean-power estimates rather than to iterative model adaptation.
5. Methods
5.1. Stage 1: Regression Models for Clean-Power Prediction
5.1.1. Model Selection
The regression models employed in this study fall into three complementary categories: latent-variable regression (PLSR, PCR), regularized linear regression (Ridge, Lasso, Elastic Net), and outlier-resistant regression (Robust). A complete description of each model’s characteristics and selection rationale is provided in
Table 7.The real operational dataset from the Shams Solar Site is primarily characterized by two statistical properties: pronounced multicollinearity among input features and skewed-normal distributions in both environmental and electrical variables [
50]. To reliably model PV performance under these conditions, all regression techniques were developed, evaluated, and analyzed within a unified modeling framework [
50,
51,
52]. To ensure fair comparison and predictive consistency, all models were trained using a common supervised learning protocol, identical feature sets, and standardized preprocessing and validation procedures. This unified training strategy ensures that observed performance differences arise from intrinsic model characteristics rather than data handling or training inconsistencies.
5.1.2. Training Protocol
All regression models were trained using a unified supervised learning framework to ensure consistency and fair comparison. The framework employed the input features and target variable listed in
Table 2, with >235,000 observations split chronologically into training (70%), validation (15%), and testing (15%) sets to preserve temporal dependencies.
Z-score normalization was applied to all continuous features to avoid scale dominance:
where the following definitions are used:
: Original value of feature for observation (e.g., , );
: Mean of feature across all observations;
: Standard deviation of feature across all observations;
: Standardized value of the feature for observation ;
Hyperparameter optimization was performed using 5-fold time-series cross-validation on the training set with Mean Squared Error (MSE) as the optimization metric. Iterative models were trained until
or maximum of 1000 iterations was reached.
Table 8 provides a complete summary of the training protocol.
5.2. Stage 2: Machine Learning-Based Cleaning Classification
The cleaning decision problem was formulated as a supervised binary classification task, where the objective is to predict the Cleaning Demand Flag (CDF) from high-resolution PV operational data. Light Gradient Boosting Machine (LightGBM) was selected due to its ability to efficiently handle large, time-ordered datasets while capturing nonlinear relationships and high-order interactions among environmental conditions, soiling accumulation, and PV performance degradation. Its ensemble-based, gradient boosting architecture provides robust modeling of both gradual performance decline and infrequent cleaning events, while built-in regularization and weighted loss functions address overfitting and class imbalance.
The model was trained to predict the binary CDF using the engineered feature set summarized in
Table 9. This feature set integrates primary performance metrics (clean power, soiled power, Soiling Loss Index), temporal and cyclical variables (cyclically encoded hour, day-of-year, month, and time since last cleaning or rainfall), time-based soiling statistics (rolling means, standard deviations, and rate of change in SLI over 1 h, 3 h, and 24 h windows), environmental interaction terms (
,
, and power indicators (voltage and current changes).
To ensure temporal consistency and prevent information leakage, the dataset was split chronologically into 70% training, 15% validation, and 15% testing subsets. Model hyperparameters were optimized using Bayesian optimization combined with 5-fold time-series cross-validation on the training data. The final optimized model employed 64 leaves, a learning rate of 0.05, a maximum tree depth of 10, a minimum of 50 samples per leaf, a feature subsampling ratio of 0.8, and L1 and L2 regularization coefficients set to 1.0. Class imbalance, where cleaning events represent a minority of observations, was addressed through cost-sensitive learning by assigning higher misclassification penalties to the minority cleaning class.
The complete cleaning decision framework was evaluated under two parallel paradigms:
Paradigm A: Features derived from real measured PV power data.
Paradigm B: Features derived from regression-based simulated clean-power outputs.
This design enables direct assessment of whether cleaning schedules generated using simulated power data remain consistent with those obtained from real measurements, while maintaining LightGBM as a common decision-making engine. Model performance was evaluated using standard classification metrics (accuracy, precision, recall, F1-score, AUC), temporal alignment via Mean Absolute Time Error (MATE), operational concordance with ground truth cleaning events, and feature importance analysis.
5.3. Evaluation Metrics
5.3.1. Soiling Loss Index and Cleaning Demand Flag
The predicted clean-power values are then used to compute the Soiling Loss Index (SLI):
where
denotes the measured soiled DC power at the same timestamp.
The SLI provides a physically meaningful, normalized indicator of contamination severity that is independent of absolute power levels and ambient conditions.
A binary Cleaning Demand Flag (CDF) was derived by applying a fixed 5% power loss threshold to the SLI. This threshold is widely adopted in the literature [
35,
36,
41] as it represents a point where the cost of cleaning is typically offset by the value of recovered energy, while remaining above typical sensor uncertainty [
19,
50].
The CDF is therefore defined as
5.3.2. Regression Model Performance Metrics
Model performance in Stage 1 was assessed using the coefficient of determination (R
2), Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), Mean Squared Error (MSE), and Mean Absolute Percentage Error (MAPE). Mathematical definitions for all regression metrics are provided in
Appendix A (
Appendix A.1 and
Appendix A.2).
To evaluate the trade-off between accuracy and model complexity, the Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC) were also computed [
52,
53]. These information criteria enable direct comparison of predictive accuracy and model parsimony across different regression approaches, accounting for differences in the number of model parameters. Lower AIC and BIC values indicate a more favorable balance between goodness of fit and model simplicity. Mathematical formulations for AIC and BIC are provided in
Appendix A (
Appendix A.3).
5.3.3. Classification Performance Metrics
Classification performance was evaluated using standard metrics derived from the confusion matrix, including accuracy, precision, recall, and F1-score [
54]. A fixed probability threshold of 0.5 was applied to convert the model’s probabilistic outputs into binary cleaning decisions. Mathematical definitions for all classification metrics are provided in
Appendix A (
Appendix A.4).
In addition to these threshold-based metrics, we evaluated discriminative ability using Receiver Operating Characteristic (ROC) curves and the Area Under the Curve (AUC). The ROC curve plots the True Positive Rate against the False Positive Rate across all possible classification thresholds, providing a comprehensive view of the model’s discriminative ability independent of threshold selection [
52,
53]. The AUC summarizes this performance as a single value between 0.5 (random guessing) and 1.0 (perfect classification), representing the probability that a randomly chosen positive instance receives a higher predicted probability than a randomly chosen negative instance [
54].
5.3.4. Temporal Alignment Evaluation
In addition to point-wise classification accuracy, temporal consistency between predicted and actual cleaning events was evaluated using the Mean Absolute Time Error (MATE). For each predicted cleaning event, the absolute difference between the predicted time
and the nearest ground-truth cleaning time
was computed:
where
denotes the number of matched cleaning events. MATE is expressed in days and reflects the practical impact of prediction errors on maintenance scheduling.
Example: If three cleaning events were predicted 1, 2, and 3 days away from the actual cleaning dates,
Thus, on average, predicted cleaning events deviate by 2 days from actual events, which is a critical consideration for operational planning, as small timing errors can lead to unnecessary cleaning actions or extended energy losses.
5.3.5. Comparative Evaluation Framework
Classification performance metrics and temporal alignment measures were computed for both paradigms under identical evaluation conditions:
- -
Paradigm A: Features derived from real measured PV power data;
- -
Paradigm B: Features derived from regression-based simulated clean-power outputs (generated by statistical models trained on real measurements).
This direct comparison enables the quantification of how deviations in simulated clean-power inputs propagate through the classification stage and affect operational cleaning decisions.
Consistency in accuracy, F1-score, and MATE between the two paradigms indicates that regression-based simulated datasets can effectively replace real measured power data for automated PV cleaning prediction.
All metrics defined in this section are reported in
Section 6 (Results), classification performance in Table 11, with regression model performance presented in Tables 13 and 14, and temporal alignment in
Section 6.1.
5.3.6. Note on Adaptive Thresholds
The choice of a 5% Soiling Loss Index (SLI) threshold in this study was guided by both the prior literature and operational practice. Studies on PV soiling have commonly adopted threshold values in the range of 3–7% for triggering cleaning interventions, balancing energy loss against maintenance costs [
35,
36]. For instance, research on utility-scale PV plants has demonstrated that an optimized cleaning threshold of approximately 5% yields the most economical outcomes when considering prevailing electricity tariffs and cleaning expenses [
35]. Furthermore, recent data-driven approaches to soiling estimation have successfully employed similar threshold values to maintain cost-effectiveness while preventing unnecessary cleaning operations [
47,
48]. The 0.5 probability threshold for classification represents the standard default in binary classification tasks and provides an unbiased decision boundary.
While a fixed threshold of
was used in this study for consistency and comparability, adaptive threshold optimization based on site-specific economic factors (energy tariffs, cleaning costs, degradation rates) represents an important direction for future research. This would allow the framework to dynamically adjust cleaning triggers based on real-time economic conditions rather than relying on a fixed percentage, as noted in
Section 8.4.
6. Results
Following the methodological framework established in
Section 3,
Section 4 and
Section 5, we now present the empirical results of applying this framework to the Shams Centre dataset. All metrics reported below are defined in
Section 5.3.
6.1. Cleaning Alert Agreement Between Real and Simulated Data
The proposed framework was evaluated using 17,838 records comprising real operational measurements and regression-based simulated electrical data. As summarized in
Table 10, the cleaning decisions obtained from both datasets are fully consistent, with zero differences in alert count and alert rate, indicating that the simulation framework does not modify cleaning frequency or scheduling.
The strong temporal alignment of predicted alerts demonstrates that daily and seasonal cleaning patterns are preserved, ensuring that maintenance actions occur at the same times when regression-based simulated data are used. In addition, the minimal misclassification risk confirms that false cleaning triggers and missed cleaning events are negligible. Overall, these results validate the reliability of the proposed simulation approach and its suitability for automated PV cleaning decision support.
6.2. Classification Performance
Table 11 presents the classification performance of the models. PLS, Elastic Net, LASSO, PCR, Ridge, and Robust achieved identical metrics for real and regression-based simulated datasets, with high accuracy, precision, recall, and F1-scores. Confusion matrix analysis showed minimal misclassifications and balanced false-positive and false-negative rates, confirming that the simulation process preserves the learned decision boundaries.
To visualize the discriminative ability of the classifiers across all possible threshold values, Receiver Operating Characteristic (ROC) curves were generated for both paradigms.
Figure 7 presents these curves, comparing performance between models trained on real measured data (Paradigm A) and those trained on regression-based simulated data (Paradigm B).
The AUC values further confirm the excellent discriminative ability of all models, with scores ranging from 0.989 to 0.993 across both real and regression-based simulated datasets. These values, substantially above the 0.5 random guessing baseline and well exceeding the 0.8 threshold typically considered ‘good’ performance [
53], indicate that the classifiers maintain near-perfect separation between cleaning and non-cleaning events regardless of whether features are derived from real or simulated power data. The minimal AUC differences between Paradigm A and Paradigm B (≤0.002) demonstrate that the regression-based simulation framework preserves the discriminative characteristics learned from real measurements. This aligns with the probabilistic interpretation of AUC as the probability that a randomly chosen positive instance receives a higher predicted score than a randomly chosen negative instance [
54]. The ROC curves in
Figure 6 visually confirm this consistency, with all curves hugging the top-left corner and showing minimal deviation between real and simulated data across all threshold values
6.3. Power Prediction Accuracy
The agreement between real and simulated electrical measurements is shown in
Table 12. DC power exhibited the strongest correspondence, confirming its central role in cleaning decision logic. DC current showed high correlation with slightly higher variability, while DC voltage displayed moderate correlation, reflecting its lower sensitivity to soiling. Despite these differences, classification performance remained unaffected.
6.4. Correlation Analysis by Model
A detailed correlation breakdown is provided in
Table 13. For all models, DC power and direct current exhibited strong to very strong R
2 values between real and regression-based simulated datasets, while voltage consistently showed moderate correlation. This pattern was observed uniformly across all regression approaches, indicating that differences in modeling strategy do not significantly alter the underlying electrical relationships captured by the simulation.
6.5. Additional Performance Metrics (MSE, AIC, BIC)
To complement correlation and classification analyses, additional metrics were computed to assess both predictive accuracy and model simplicity across all the regression approaches considered: PLS, Elastic Net, LASSO, PCR, Ridge, and Robust Regression. The metrics include Mean Squared Error (MSE), Akaike Information Criterion (AIC), and Bayesian Information Criterion (BIC), which provide complementary insights into absolute prediction errors and model parsimony. Detailed mathematical formulations of these metrics are provided in
Appendix A (
Appendix A.1 and
Appendix A.2).
All metrics are summarized in
Table 14, allowing comparison of model performance in terms of prediction accuracy, simplicity, and interpretability.
Table 15 demonstrates that all regression models achieve high predictive accuracy on both real and regression-based simulated datasets. Differences between real and simulated predictions are minimal, indicating that the synthetic clean-power outputs reliably replicate operational behavior. PLSR and Elastic Net exhibit a strong balance between accuracy and model complexity, as reflected in both MSE and information criteria (AIC/BIC), making them suitable candidates for generating Stage 2 cleaning classification features.
6.6. Model-Level Performance Comparison
Building on the previous analyses, the relative strengths and limitations of each regression approach are summarized in
Table 14, considering classification performance, electrical reconstruction accuracy, robustness, and interpretability. While most models achieve comparable classification results, differences emerge in their handling of multicollinearity, sensitivity to noise, prediction stability, and suitability for automated deployment.
Consistent with the data boundaries defined in
Section 4.7, we emphasize that the simulated clean-power data used throughout these analyses are generated exclusively by Stage 1 models trained on historical clean-period measurements. These simulated outputs serve as fixed inputs to Stage 2, with no feedback or retraining occurring between stages. This independence ensures that the comparative performance summarized in
Table 14 reflects intrinsic model characteristics rather than iterative adaptation between stages.
Table 16 synthesizes the relative strengths of each regression approach. Notably, while most models achieve comparable classification outcomes, PLS and PCR demonstrate superior reconstruction fidelity, whereas Elastic Net and Ridge offer enhanced numerical stability. LASSO provides the most interpretable coefficients, and Robust Regression ensures resilience to outliers. The comparative evaluation emphasizes that model choice should balance accuracy, interpretability, and robustness, depending on operational requirements and deployment constraints.
The two-stage methodology that produced these results is summarized in
Table 17. This framework ensures specialization, modularity, interpretability, robustness, and practicality, allowing each modeling component to focus on its optimal task while enabling clear mapping from physical power prediction to actionable maintenance decisions.
The framework’s modular design supports the independent optimization and validation of each stage, while the consistent performance across multiple regression approaches demonstrated in
Table 9,
Table 10,
Table 11,
Table 12,
Table 13 and
Table 14 confirms that the architecture reliably converts power predictions into accurate cleaning schedules regardless of the specific modeling technique employed.
7. Discussion
PLS Regression (PLSR) achieved the highest fit for real data, benefiting from its supervised latent-variable projection, and maintained nearly identical performance on regression-based simulated data. PCR provided a slightly lower fit due to its unsupervised latent-variable approach but still offered reliable reconstruction of DC power.
Among the regularized models, Ridge and Elastic Net balanced accuracy and stability effectively, achieving strong fits on both real and regression-based simulated data. Ridge regression demonstrated robust handling of multicollinearity, while Elastic Net combined sparsity with numerical stability, showing consistent performance even on synthetic datasets. LASSO produced a sparse and interpretable solution with slightly higher error, reflecting a modest trade-off between sparsity and reconstruction accuracy. Robust Regression resisted the influence of outliers, maintaining comparable predictive performance across both datasets
Evaluation metrics including RMSE, MSE, MAE, AIC, and BIC provide complementary insights. All models achieved low RMSE and MSE, indicating high predictive fidelity, while AIC and BIC values confirm that the models maintain a favorable balance between accuracy and complexity. Differences between real and regression-based simulated datasets were minor, highlighting the framework’s ability to preserve operational cleaning patterns and accurately capture PV system behavior.
Overall, the analysis confirms that the framework reliably predicts cleaning requirements across diverse regression approaches. PLS and Elastic Net offer a combination of accuracy and interpretability, PCR excels in high-fidelity reconstruction, LASSO provides sparse solutions for easier coefficient analysis, Ridge ensures stability under multicollinearity, and Robust Regression protects against outliers. These results demonstrate that the proposed framework effectively integrates environmental, temporal, and soiling-related features to generate precise, automated cleaning schedules.
7.1. Interpretation of Key Findings
The consistently high correlation between real and regression-based simulated data for DC power (R > 0.97) and current (R > 0.95) indicates that statistical models trained on historical measurements successfully capture the underlying relationships between environmental conditions and electrical output. This fidelity arises from the models’ ability to learn site-specific characteristics including local irradiance patterns, thermal response rates, and soiling accumulation kinetics that physics-based simulations would require extensive parameterization to replicate [
18,
20]. The supervised latent-variable approach of PLSR demonstrated superior performance in capturing the covariance structure between environmental predictors and power output, consistent with findings in [
21,
22,
23,
24,
25]. This advantage stems from PLSR’s ability to maximize covariance between predictors and target variable information during dimensionality reduction, unlike PCR which focuses solely on predictor variance [
26].
The moderate correlation observed for DC voltage (R ≈ 0.52) across all models reflects the physical reality that voltage is less sensitive to soiling than current under the environmental conditions prevalent at the Shams Centre site. This aligns with findings from [
26,
29], who reported that soiling-induced performance degradation manifests primarily through current reduction in monocrystalline silicon modules operating at moderate temperatures. The consistency of this voltage insensitivity across all regression approaches confirms that this phenomenon is physical rather than model dependent.
The near-perfect agreement between cleaning decisions derived from real and simulated data (F1-score difference < 0.001, AUC difference ≤ 0.002) demonstrates that the regression-based simulation framework preserves the discriminative characteristics learned from real measurements. This decision-level equivalence is particularly significant because it shows that errors in continuous power prediction do not propagate forward to the binary classification stage a finding with important practical implications.
7.2. Model Performance Synthesis
The reconstruction fidelity achieved by latent-variable methods (PLSR, PCR) in this study (R
2 > 0.95 for power) exceeds those reported by [
21] (R
2 ≈ 0.91) and [
24] (R
2 ≈ 0.93) for similar PV datasets. This improvement can be attributed to the inclusion of engineered interaction terms (ΔT, G × T) that capture nonlinear thermodynamic effects often overlooked in standard regression approaches [
27,
28]. The regularized models (Ridge, Lasso, Elastic Net) demonstrated robust handling of multicollinearity, confirming their suitability for PV applications where environmental predictors are inherently correlated [
22,
23].
The classification performance (F1-score = 0.979, AUC = 0.99) substantially surpasses the results reported in recent PV cleaning decision framework studies. For example, ref. [
47] achieved F1-scores of 0.82–0.89 using threshold-based methods, while [
48] reported AUC values of 0.91–0.94 with tree-based classifiers. This advancement stems from three factors: (i) the two-stage architecture that decouples power prediction from classification, (ii) the rich feature engineering incorporating soiling dynamics (rolling statistics, rate of change, environmental interactions), and (iii) the high-quality training data from the Shams Centre facility [
50,
51,
52].
The consistency across six regression models (PLSR, PCR, Ridge, Lasso, Elastic Net, Robust) demonstrates that the framework’s effectiveness is not tied to a specific modeling approach. This robustness aligns with ensemble principles in machine learning [
43], suggesting that the engineered features and two-stage architecture rather than any single algorithm drive the strong performance.
7.3. Implications for PV Maintenance
The near-perfect agreement between cleaning decisions derived from real and simulated data has significant practical implications. For PV installations in data-constrained environments such as remote desert locations or large-scale plants with limited sensor coverage regression-based simulated data can effectively substitute for missing or unreliable measurements [
13,
14,
15]. This enables predictive maintenance where it would otherwise be infeasible, addressing a key barrier to PV optimization in developing regions [
1,
4].
From an economic perspective, the high precision (0.98) and recall (0.98) translate to tangible operational benefits. False-positive cleaning events which incur unnecessary labor, water, and equipment costs are minimized, while false-negative events that would allow prolonged energy losses are avoided. This aligns with techno-economic assessments by [
35,
36], who estimated that optimized cleaning schedules can reduce operational expenditure by 20–40% compared to fixed-interval cleaning. The framework’s ability to detect both rapid soiling events (via ΔSLI/Δt) and gradual accumulation (via rolling means) ensures timely interventions across diverse soiling scenarios [
31,
34].
The modular two-stage architecture supports integration with existing PV monitoring infrastructure and automated cleaning systems [
40,
45]. Practitioners can select regression approaches based on operational priorities-LASSO for interpretability, Ridge for stability under multicollinearity, Robust for outlier-prone environments without compromising cleaning schedule accuracy. This flexibility positions the framework as a practical tool for diverse climatic regions and system configurations [
23,
38,
46].
7.4. Comparison with Previous Studies
The findings of this study align with and extend the existing literature on PV soiling and predictive maintenance. Consistent with [
21,
22,
23,
24,
25], regularized regression methods demonstrated effective handling of multicollinearity among environmental predictors. The high reconstruction fidelity achieved by PLSR and PCR supports the findings of [
26,
27,
28,
29,
30], who reported that latent-variable projection methods effectively capture the covariance structure between meteorological variables and electrical output. The high AUC values (0.989–0.993) observed across all models exceed the performance typically reported for PV cleaning classifiers in the literature, where AUC values in the range of 0.85–0.95 are more common [
47,
48], highlighting the effectiveness of the two-stage framework and engineered features.
The near-perfect agreement between cleaning decisions derived from real and simulated datasets addresses a critical gap identified in recent reviews [
6,
15,
39]. While previous studies focused primarily on signal-level reconstruction accuracy [
8,
9,
10,
11,
12,
16,
17,
18,
19,
20], this work demonstrates that simulation-driven frameworks can achieve decision-level equivalence, maintaining operational consistency in cleaning schedules. This extends the work of [
50,
51,
52], who established the statistical properties of the Shams Centre dataset, by demonstrating that these properties do not limit generalizability to simulated data.
The inclusion of engineered features capturing environmental interactions and soiling dynamics builds upon recommendations in [
31,
32,
33,
34] for the improved detection of both rapid and gradual soiling events. The success of rolling statistics (mean SLI, standard deviation, rate of change) confirms the importance of temporal context in soiling prediction, supporting findings from [
8,
9] on the value of time-series features.
7.5. Practical Implications
While the framework demonstrates robust performance, several limitations warrant consideration. First, the regression-based simulated data are inherently constrained by the training data distribution; extreme environmental conditions outside the historical range may not be accurately represented [
19]. This suggests that periodic model retraining with updated data is necessary to maintain accuracy under changing climatic conditions.
Second, the site-specific nature of engineered features may limit generalizability to PV systems with different module technologies, orientations, or climatic profiles [
23,
38]. The Shams Centre facility uses monocrystalline silicon modules in a hot, arid coastal climate; validation across additional sites with different technologies (polycrystalline, thin-film) and climates (temperate, tropical, high-altitude) would strengthen confidence in the framework’s transferability.
Third, the fixed 5% SLI threshold, while empirically validated for this installation and supported by the literature [
35,
36,
41], may not be economically optimal across varying energy tariffs and cleaning costs [
40]. Future implementations could benefit from adaptive thresholding that considers real-time electricity prices, water costs, and degradation rates.
Fourth, although six regression models were evaluated, the exploration of deep learning approaches (LSTM, CNN) for power prediction remains for future work. Given the temporal nature of PV data, sequence models may capture longer-range dependencies than the current regression framework [
43,
44].
Finally, the framework’s reliance on high-quality training data means that sensor malfunctions or missing cleaning logs could degrade performance. Implementation in operational settings should include data quality monitoring and fallback procedures [
50].
8. Conclusions and Future Work
8.1. Summary of Contributions
This study presented a comprehensive framework for predicting Photovoltaic (PV) panel cleaning requirements using real operational data and simulated clean-power datasets. By leveraging advanced feature engineering, temporal and environmental interactions, and multiple regression models including PLS, Elastic Net, LASSO, PCR, Ridge, and Robust Regression, the framework demonstrated high accuracy and reliability in detecting soiling-induced performance degradation.
8.2. Key Findings
- -
Feature Engineering for Soiling Detection:
Engineered features capturing module–ambient temperature differences, irradiance–temperature interactions, and soiling accumulation trends enabled the models to identify subtle performance deviations. Time-based and rolling soiling metrics enhanced the model’s ability to detect both rapid and gradual soiling events, supporting timely cleaning interventions.
- -
Model Evaluation and Performance:
Across 17,838 real and regression-based simulated observations, all models achieved high classification accuracy for cleaning predictions. PLS and PCR excelled in reconstructing electrical outputs with latent-variable projections, while Elastic Net balanced sparsity and numerical stability. LASSO provided interpretable sparse solutions, and Ridge and Robust Regression delivered consistent performance under multicollinearity and outlier conditions. MSE, AIC, and BIC metrics confirmed both predictive fidelity and model parsimony, complementing correlation-based assessments.
- -
Simulation Accuracy:
The simulated clean-power datasets preserved the timing, frequency, and magnitude of cleaning alerts observed in real measurements, demonstrating the framework’s capability to support PV cleaning decision-making even when only synthetic data are available. Temporal alignment and zero differences in alert counts ensured that maintenance schedules remain operationally consistent.
- -
Operational Implications:
The framework provides an effective automated cleaning decision support system, capable of minimizing missed-cleaning and false-cleaning events while reducing reliance on manual inspection. By integrating multiple regression approaches and a robust feature set, the methodology ensures scalability and adaptability to different PV installations and environmental conditions.
8.3. Limitations
First, the model is sensor-dependent, relying on high-resolution electrical and environmental measurements; any sensor malfunction or missing data could reduce prediction accuracy.
Second, the site-specific nature of engineered features, such as soiling indices and environmental interactions, may limit the framework’s generalizability to PV systems with different module types, orientations, or climatic conditions. It is important to emphasize that the dataset used in this study originates exclusively from an arid desert climate (Oman), characterized by high solar irradiance, elevated temperatures, low rainfall, and significant dust accumulation. While this provides an excellent testbed for evaluating soiling effects, the framework’s direct transferability to other climatic regions such as temperate zones with frequent rainfall, tropical regions with high humidity, or snowy climates cannot be assumed without validation. Different soiling mechanisms (e.g., snow coverage, bird droppings in coastal areas, pollen accumulation in agricultural regions) and natural cleaning patterns (e.g., regular rain) would likely require model retraining and potential feature adaptation to maintain predictive accuracy.
Third, although six regression models (PLS, Elastic Net, LASSO, PCR, Ridge, and Robust) were evaluated, other regression or machine learning approaches remain unexplored, and may provide additional improvements in adaptability, interpretability, or reconstruction fidelity.
Finally, accurate predictions require consistent operational logs; missing or delayed cleaning records could negatively impact model reliability in real-world deployments.
8.4. Future Work
Future research will focus on extending the framework across multiple PV sites and climatic regions, incorporating additional environmental and operational variables, and evaluating a broader set of regression and hybrid models, including physics-informed approaches. A critical next step is conducting cross-site validation studies that apply the current framework to PV systems operating in diverse climatic zones including temperate, tropical, Mediterranean, and snowy regions to empirically assess transferability and identify necessary adaptations. A particularly important direction is the development of adaptive threshold optimization frameworks that dynamically adjust cleaning triggers based on site-specific economic factors. This would involve integrating real-time data on local electricity tariffs, cleaning operation costs, and module degradation rates to replace the fixed 5% threshold with an economically optimal, time-varying threshold. Such an approach would enable the framework to make maintenance decisions that directly maximize return on investment rather than simply maintaining a fixed performance loss criterion. Furthermore, we plan to explore transfer learning techniques that would allow models pre-trained on data from well-instrumented sites (such as the Shams Centre) to be fine-tuned with limited local data from new installations. This approach could significantly reduce the data requirements for deploying the framework in new climatic regions while maintaining predictive accuracy. Efforts will also target automating feature adaptation, accounting for seasonal or degradation-related variations, integrating with automated cleaning scheduling, enhancing model interpretability, and quantifying prediction uncertainty to ensure reliable and scalable deployment in diverse operational contexts. Addressing these limitations through the proposed future work will further enhance the framework’s robustness, generalizability, and practical utility for PV maintenance optimization across diverse operational contexts.