A Two-Stage Comparative Framework for Predicting Photovoltaic Cleaning Schedules: Modeling and Comparisons Based on Real and Simulated Data

Al-Humairi, Ali; Khalis, Enmar; Hemyari, Zuhair A. Al; Jung, Peter

doi:10.3390/app16062976

Open AccessArticle

A Two-Stage Comparative Framework for Predicting Photovoltaic Cleaning Schedules: Modeling and Comparisons Based on Real and Simulated Data

¹

Department of Communication Technology, Duisburg-Essen University, 47057 Duisburg, Germany

²

Computer Science Department, Faculty of Engineering and Computer Science, German University of Technology in Oman, Muscat 130, Oman

³

Department of Mathematical and Physical Science, College of Science, University of Nizwa, Nizwa 616, Oman

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(6), 2976; https://doi.org/10.3390/app16062976

Submission received: 17 January 2026 / Revised: 10 March 2026 / Accepted: 13 March 2026 / Published: 19 March 2026

Download

Browse Figures

Versions Notes

Abstract

This study develops and validates a two-stage comparative framework for predicting Photovoltaic (PV) cleaning schedules by integrating high-resolution operational data with regression-based simulated datasets generated from statistical models trained on real measurements. The work directly addresses the growing need to assess whether model-based regression-based simulated data can reliably substitute real measurements in predictive PV maintenance. These models are employed to generate clean-condition power baselines and to estimate daily energy losses attributable to soiling under two distinct paradigms: (i) using real historical PV performance and environmental measurements, and (ii) using regression-derived, regression-based simulated data representing idealized clean operating conditions. Model performance is rigorously quantified using correlation coefficients (R), coefficients of determination (R²), mean absolute deviations, and binary classification metrics including accuracy, precision, recall, and F1-score. The comprehensive results demonstrate that regression-based simulated datasets exhibit high fidelity with real measurements across key electrical variables. This is evident for datasets generated using PLSR, Ridge Regression, and Robust Regression. Strong correlations are observed for DC power (R² = 0.9545) and DC current (R² = 0.9520), with mean deviations consistently below 2.2%. When a threshold-based binary decision rule (“clean” versus “do not clean”) is applied, cleaning decisions derived from simulated and real datasets show near-perfect concordance, achieving a mean F1-score of 0.9792. These results indicate that for a fixed performance-loss threshold, models using regression-based simulated data reproduce real-data-based cleaning triggers with an accuracy exceeding 97%. Furthermore, the findings confirm that regression-based simulation frameworks constitute a reliable and scalable foundation for data-driven PV maintenance optimization. By enabling efficient cleaning scheduling, these frameworks can significantly reduce operational expenditure and maximize energy yield, particularly in regions where continuous, high-quality PV monitoring data are limited or difficult to obtain.

Keywords:

Photovoltaic (PV) systems; cleaning schedules; regression-based simulated data; real operational data; predictive PV maintenance; modeling; soiling losses; energy losses

1. Introduction

The rapid deployment of Photovoltaic (PV) systems has created an urgent need for scalable, data-driven methods to optimize performance and maintenance strategies. Environmental soiling the accumulation of dust, pollen, and other airborne particulates on module surfaces is a primary cause of efficiency loss. It produces nonlinear, site-dependent reductions in electrical output that vary with climate, season, and operational conditions. Conventional maintenance approaches, such as fixed cleaning schedules or reactive inspections, are often inadequate for large-scale PV plants, as they fail to capture the dynamic nature of soiling.

To overcome these limitations, predictive maintenance frameworks have emerged that integrate high-resolution operational measurements with synthetic datasets. We distinguish between two types of synthetic data: physics-based simulations (derived from physical equations) and regression-based simulated data (generated by statistical models trained on real measurements). This study focuses exclusively on the latter. These frameworks allow the estimation of clean-condition power baselines, precise quantification of energy losses, and dynamic scheduling of cleaning interventions. When combined with statistical regression techniques, they provide robust modeling of PV performance despite multicollinearity, high-dimensional inputs, and measurement noise from environmental variables such as irradiance, module temperature, humidity, and wind speed.

Among regression approaches, Partial Least Squares Regression (PLSR), Principal Component Regression (PCR), Ridge Regression, Lasso Regression, Elastic Net, and Robust Regression are particularly effective. PLSR and PCR reduce dimensionality through latent-variable projection. Ridge and Lasso employ L2 and L1 regularization to improve stability and sparsity; Elastic Net combines both penalties; and Robust Regression mitigates outlier effects caused by shading, inverter clipping, or sensor errors. By leveraging both real and simulated PV datasets, these models enable the generation of adaptive cleaning schedules that closely replicate actual performance degradation, reducing operational costs while maximizing energy yield.

This study builds on these advances by providing a comprehensive evaluation of multiple regression models for PV cleaning schedule prediction, establishing a rigorous framework for predictive maintenance in large-scale, data-constrained PV systems. The following sections review soiling-induced losses, synthetic dataset generation, and regression-based predictive modeling, forming the technical foundation for the proposed approach. The central objective of this paper is to determine how effectively cleaning schedules derived from simulated model outputs replicate those obtained from real operational data, thereby assessing the decision-level reliability and practical suitability of synthetic datasets for PV predictive maintenance.

2. Literature Review

Environmental soiling is recognized as a major determinant of PV system performance, exhibiting strong dependence on regional climatic conditions, atmospheric particulate concentration, and seasonal variability. Empirical studies report energy losses ranging from approximately 2% in temperate regions to over 50% in arid, dust-prone environments [1,2,3,4]. Conventional maintenance strategies, such as fixed-interval cleaning or reactive inspections following observed performance drops, fail to account for the stochastic and site-dependent nature of soiling, leading to either excessive operational costs or sustained energy yield reductions [5,6,7]. Consequently, predictive maintenance frameworks have been proposed that leverage either high-resolution operational measurements or synthetic datasets generated from physics-based or empirical PV models to determine optimal cleaning schedules [5,6,7,8,9,10,11,12].

Notably, while some studies employ physics-based simulation frameworks [8,9,10,11,12], the present work generates regression-based simulated datasets using statistical regression models trained on real operational data, which we refer to as “regression-based simulated data” throughout. The generation of high-fidelity synthetic PV datasets has become a critical tool for predictive maintenance. Advanced simulation frameworks integrate irradiance transposition, spectral mismatch correction, dynamic thermal modeling, soiling accumulation kinetics, and diode-based IV curve representations to produce realistic time-series data for DC power, current, and voltage [8,9,10,11,12]. These synthetic datasets facilitate the training of predictive algorithms, scenario-based analyses, and uncertainty quantification, particularly in environments where field measurements are sparse, incomplete, or economically prohibitive [13,14,15]. Validation studies indicate that well-calibrated simulation frameworks can reproduce 90–98% of observed PV performance metrics, demonstrating their suitability for operational decision support [16,17,18,19,20].

Statistical regression and data-driven modeling techniques have been extensively applied to capture the relationships between environmental variables and PV electrical outputs. Linear and regularized regression approaches including Partial Least Squares Regression (PLSR), Principal Component Regression (PCR), Ridge Regression, Lasso Regression, Elastic Net, and Robust Regression are particularly effective in addressing multicollinearity, high-dimensional predictor spaces, and measurement noise [21,22,23,24,25,26,27,28,29,30]. PLSR and PCR reduce dimensionality through latent-variable projection, Ridge and Lasso employ L2 and L1 regularization to enhance parameter stability and sparsity, Elastic Net combines both penalties, and Robust Regression mitigates the influence of outliers caused by transient events such as partial shading, inverter clipping, or sensor anomalies [21,22,23,24,25,26,27,28,29,30]. When applied to synthetic datasets, these methods enable robust prediction of clean-condition electrical baselines, performance degradation trends, and soiling-induced energy losses [31,32,33,34].

Recent developments in hybrid modeling approaches, which integrate physical knowledge with data-driven techniques, have further improved the generalizability of PV performance models across different module technologies, system configurations, and climatic zones [35,36,37,38]. Moreover, advanced simulation frameworks now capture detailed electrical characteristics, including nonlinear IV behavior, temperature-dependent resistive losses, and spectral response variability, enabling more realistic synthetic datasets for predictive modeling [2,5,11,27].

Despite these advancements, comprehensive comparative evaluations of multiple regression methodologies specifically for PV cleaning schedule prediction remain scarce. Existing studies predominantly focus either on continuous electrical parameters or on binary cleaning decisions, rarely considering both simultaneously [6,15,26,39]. This limitation constrains the assessment of how accurately synthetic datasets can replicate real operational performance and support adaptive maintenance strategies. The present study addresses this gap by systematically evaluating a suite of regression approaches including PLSR, PCR, Ridge, Lasso, Elastic Net, and Robust Regression using both real and synthetic PV datasets. Performance is quantified through correlation coefficients (R), coefficients of determination (R²), mean absolute deviations, and binary classification metrics, including accuracy, precision, recall, and F1-score [40,41,42,43,44,45,46,47,48,49,50,51,52,53,54]. This framework enables a rigorous comparison of regression methodologies, assessing their capability to generate reliable clean-condition baselines, estimate soiling-induced energy losses, and inform data-driven PV cleaning schedules under variable environmental and operational conditions.

Several recent studies have explored forecasting methods for solar generation. The study [55] presented a comprehensive analysis of predictive models for forecasting solar generation in microgrids, evaluating multiple methodologies through three error metrics and benchmarking against average base values. However, their study focused exclusively on microgrid systems, which limits its applicability to larger PV installations.

Another forecasting method is discussed in [56], which is based on an exponential smoothing approach, with a comparative evaluation of four exponential smoothing methods presented. The results demonstrate the importance of tailoring forecasting models to the specific temporal structure of PV power. In [57], the author discusses the analysis and performance evaluation of irradiance forecasting models, using statistical indicators to compare predicted output with actual output from experimental trials. However, humidity and wind speed were not included as predictors in their study, representing a limitation that the present work addresses through comprehensive environmental feature integration.

3. Methodology

This study adopts a two-stage predictive maintenance framework to rigorously evaluate whether statistically simulated Photovoltaic (PV) datasets can reliably substitute real operational measurements to solar panel cleaning schedule prediction. The objectives of the proposed framework are to preserve physical interpretability, ensure statistical consistency, and enable decision-level comparability between real-data-driven and regression-based simulation-driven maintenance strategies.

The proposed workflow integrates two sequential and tightly coupled stages:

Stage 1: Clean-Condition Power Prediction Using Statistical Regression Models;
Stage 2: Cleaning decision framework Using Machine Learning.

In Stage 1, multiple validated statistical regression models are employed to estimate the clean-condition DC power baseline of the PV system using environmental and operational inputs [50,51,52]. These models learn the functional relationship between meteorological variables (e.g., plane-of-array irradiance, module and ambient temperatures, wind speed, and humidity) and the electrical output of a continuously cleaned reference string (see, Section 4.1). The resulting regression outputs represent idealized clean-condition power trajectories that are subsequently used to quantify soiling-induced performance losses. By comparing predicted clean power with measured soiled power, a Soiling Loss Index (SLI) is computed, forming a physically meaningful indicator of surface contamination and degradation.

In Stage 2, the problem is reformulated as a binary classification task, where the objective is to determine whether cleaning is required at a given time step. A supervised machine learning classifier is trained using features derived from the SLI, rolling statistical descriptors of soiling evolution, environmental interactions, and temporal variables. The classifier outputs a Cleaning Demand Flag (CDF), translating continuous power degradation estimates into actionable maintenance decisions (see, Section 4.2). Importantly, this classification stage is executed under two parallel paradigms:

(i): Features derived from real measured clean-power data;
(ii): Features derived from regression-based simulated clean-power outputs.

This parallel evaluation enables a direct and controlled comparison of cleaning schedules generated from real versus regression-based simulated datasets.

The two-stage architecture offers several advantages. First, it decouples physical power modeling from decision-making logic, allowing each stage to be independently optimized and validated. Second, it ensures that any discrepancies between real and regression-based simulated data are evaluated not only at the signal level but also at the operational decision level, which is the ultimate concern for maintenance planning. Finally, the framework supports scalability and transferability, making it suitable for PV systems operating under data-scarce or sensor-limited conditions.

A schematic overview of the complete methodology is provided in Figure 1, illustrating the flow from data acquisition and preprocessing through regression-based power modeling, soiling loss estimation, and the machine-learning-based cleaning decision framework, as well as the comparative validation between real-data and simulation-based paradigms.

The following section describes the modeling approaches employed in both stages of the proposed framework, including the statistical regression models and training protocols used for clean-condition power prediction in Stage 1 and the machine learning–based classifiers applied in Stage 2 for soiling loss interpretation and the cleaning decision framework.

4. Dataset and Preprocessing

The dataset examined in this study is recorded at the Shams Centre, a solar energy research station operated by the German University of Technology in Oman (GUtech). Data collection was carried out using a sensor network and monitoring points to ensure quality and high granularity, rendering it an excellent basis for machine learning–based predictive cleaning models.

To guide the reader through the dataset description, Table 1 provides a concise summary of the key characteristics of the Shams Centre dataset. Following this overview, Section 4.1, Section 4.2, Section 4.3, Section 4.4, Section 4.5, Section 4.6 and Section 4.7 detail the data sources, simulation framework, preprocessing steps, feature engineering, and data provenance.

4.1. Data Sources

This study relies on high-resolution, real-world operational measurements collected from the 30 kW Shams Centre Photovoltaic (PV) research facility at the German University of Technology in Oman. The dataset provides detailed electrical and environmental records obtained under actual operating conditions, forming the primary empirical foundation for model development, training, validation, and performance assessment. The measurement infrastructure, sensor specifications, data acquisition procedures, and data quality assurance protocols are described comprehensively in [50,51], ensuring the reliability and integrity of the dataset.

The 5 min resolution dataset encompasses over 235,000 multivariate time-series observations. Beyond the variables summarized in Table 1, it includes validated logs of both manual and automated cleaning activities. This precise temporal alignment between environmental conditions, soiling accumulation, and operational interventions is critical for accurate model training and evaluation.

While these high-quality measurements provide a rich and reliable foundation, real-world data can be limited by operational gaps, sensor noise, or coverage constraints. To complement the empirical dataset and enable systematic, controlled assessment of model performance, the study also incorporates simulated clean-power datasets, described in the following subsection. These simulations provide baseline clean-condition power outputs free from soiling effects, allowing direct comparison with measured data and evaluation of the predictive capability of the PV cleaning models.

4.2. Simulated Clean-Power Datasets

To complement real-world measurements and provide a controlled baseline for model development, this study employs regression-based simulations to generate clean-condition PV power outputs. It is important to clarify that these are not physics-based simulations. Rather, they are statistically generated datasets produced by regression models trained on real operational measurements from the Shams Centre facility.

The regression-based simulation framework leverages a set of carefully selected statistical models: Partial Least Squares Regression (PLSR), Principal Component Regression (PCR), Ridge Regression, Lasso Regression, Elastic Net, and Robust Regression. These models were chosen for their ability to capture linear relationships, apply regularization, and tolerate noise while handling the multicollinearity and skewed distributions characteristic of PV operational data.

For each model, the learned relationship between environmental inputs (irradiance, temperature, humidity, wind speed) and clean-power output is used to generate synthetic clean-power values under the same environmental conditions. These regression-based simulated datasets represent the expected PV system behavior under clean conditions as learned from historical data, providing a theoretical reference signal for quantifying soiling-induced losses.

The key distinction is therefore as follows:

-: Real data: Directly measured from sensors;
-: Regression-based simulated data: generated by statistical models trained on real data;
-: Physics-based simulated data: generated by physical equations (not used in this study).

These regression-based simulated baselines provide an essential point of comparison for real measurements, enabling the identification of deviations caused by soiling, environmental variability, or operational uncertainties. By removing the effects of soiling and operational noise, the simulations allow for a controlled assessment of model performance and the evaluation of predictive strategies under ideal conditions, while also enabling systematic comparison of cleaning decisions derived from real versus regression-based simulated data.

4.3. Data Cleaning and Temporal Alignment

To ensure data reliability and consistency, all datasets both real and simulated were synchronized to a unified timestamp format. Invalid or physically implausible values were removed, and temporal consistency was verified using diurnal voltage and power profiles. This process ensured that electrical, meteorological, and cleaning-log records were fully aligned and structurally consistent, forming a robust foundation for subsequent feature extraction and predictive modeling.

4.3.1. Initial Data Inspection and Correlation Analysis

Initial data inspection and pre-processing included timestamp alignment, correlation analysis, and numerical integrity checks. After standardizing timestamps and aligning all records to a single datetime index, additional temporal features were extracted to capture daily and operational patterns. A correlation heatmap (Figure 2) revealed strong positive relationships between irradiance, temperature, and power, alongside negative correlations with soiling variables. These insights guided feature selection and informed the design of predictive models by highlighting the primary drivers of PV system performance.

4.3.2. Temporal Standardization and Numerical Integrity Checks

Time-based features capturing diurnal and seasonal variations were derived from the unified timeline, including hour, minute, day of the week, day of the year, and a weekend flag. The accuracy of this temporal alignment was confirmed using voltage–time curves (Figure 3), which displayed the expected daily trends: a gradual morning rise, a stable midday peak, and a slow afternoon decline.

All numerical fields were standardized and checked to ensure realistic measurements. Irradiance values were confirmed to be non-negative, DC power remained non-zero during daylight hours (except at sunrise or sunset), temperatures fell within realistic climatic ranges, and voltage/current readings showed smooth transitions without abrupt spikes. No data points required removal, highlighting the robustness of the measurement infrastructure. Additional validation compared DC power distributions against irradiance patterns, confirming consistency across the unified timeline (Figure 4).

Feature scaling was applied to ensure all variables contributed equally to model training. Continuous variables were standardized using z-score normalization, while categorical and bounded variables were normalized using Min–Max scaling to a [0, 1] range. Proper scaling and normalization prevent variables with larger magnitudes from dominating model training and help improve convergence and predictive performance. The following subsection describes the methods used to standardize continuous variables and normalize bounded or categorical features, producing a harmonized dataset ready for robust model development.

4.4. Scaling and Normalization

With the datasets now clean, temporally aligned, and verified for numerical integrity, the next step is preparing features for modeling. This involves transforming both raw and engineered variables to ensure consistency in scale and suitability for machine learning algorithms. Proper scaling prevents variables with larger magnitudes from dominating model training, improving convergence, stability, and predictive performance.

Continuous variables, including temperatures, irradiance, DC and AC power, and other electrical outputs, were standardized using z-score normalization, centering each variable at zero with unit standard deviation. This ensures that differences in magnitude do not bias the learning process and allows models to weigh each feature appropriately.

Bounded or categorical variables, such as soiling indicators, encoded time-based features (hour of the day, day of the week, weekend flags), and operational markers, were normalized using Min–Max scaling, transforming all values to a uniform range (typically 0–1) while maintaining relative relationships. Figure 5 illustrates the integration of both standardization and normalization across the dataset, combining primary sensor measurements with engineered features.

The resulting dataset is consistent, harmonized, and model-ready, providing a robust foundation for PV cleaning prediction. By standardizing and normalizing both raw and derived features, the models can effectively learn relationships between environmental conditions, electrical outputs, soiling effects, and cleaning requirements.

This fully prepared dataset now allows for a comprehensive statistical summary, which characterizes the distributions, variability, and interrelationships of all cleaned and normalized variables, providing essential insights prior to model development.

4.5. Statistical Summary of the Cleaned Dataset

With the datasets now clean, temporally aligned, and properly scaled, it is essential to quantitatively evaluate their distributions, variability, and interrelationships before proceeding to model development. This statistical overview ensures that all features both raw and engineered are well-characterized and suitable for learning PV cleaning patterns. It also provides insights that guide the selection and construction of additional predictive features.

A comprehensive descriptive summary of the cleaned dataset is presented in Table 2, Table 3 and Table 4. These statistics provide insight into central tendency, variability, and operating ranges across electrical, environmental, and soiling-related variables. The summary supports validation of data integrity, assessment of sensor consistency, and verification of assumptions applied during model development. The presence of realistic ranges, appropriate scaling, and physically consistent relationships indicate that the cleaned dataset is suitable for downstream feature engineering and machine learning tasks.

Correlation analysis was performed where strong positive correlations were observed between irradiance and power outputs, while soiling indicators showed negative correlations with electrical performance. Time-based features, including hour of the day, day of the week, and seasonal markers, displayed clear daily pattern and seasonal patterns in PV behavior, thus validating their inclusion as engineered predictors for cleaning schedule models.

This statistical characterization establishes a quantitative foundation for feature engineering, highlighting which variables carry meaningful signals and how they relate to one another. With this understanding, the study can now proceed to the construction and optimization of features that capture temporal dependencies, environmental influences, and soiling effects, producing a dataset ready for robust predictive modeling.

4.6. Feature Engineering

To improve the model’s ability to predict cleaning needs, we engineered a set of features designed to capture non-linear PV behavior, environmental interactions, and the temporal dynamics of soiling progression. These features are organized into five categories, summarized in Table 5, which address distinct aspects of PV performance and degradation. Table 5 summarizes these categories and their purpose before the detailed presentation in Section 4.6.1, Section 4.6.2, Section 4.6.3, Section 4.6.4 and Section 4.6.5.

4.6.1. Environmental Interaction Features

PV performance is driven by the combined, non-linear interaction of multiple environmental factors. To capture these coupled effects, we derived three interaction features based on domain knowledge:

i. Module–Ambient Temperature Difference (ΔT):

Δ T = T_{module} - T_{ambient}

(1)

This indicates the module’s heat dissipation efficiency. Higher values correspond to reduced convective cooling and increased voltage drop, capturing thermal stress beyond absolute temperatures.

ii. Irradiance–Temperature Interaction:

G_{POA} \times T_{module}

(2)

This interaction term represents the competing effects of high irradiance, which boosts current, and high temperature, which reduces voltage a relationship characteristic of desert environments.

iii. Humidity–Wind Interaction:

H × W

(3)

This combined feature captures the net effect of humidity, which promotes particle adhesion, and wind speed, which enhances cooling and can mechanically remove dust, on both soiling dynamics and thermal management.

4.6.2. Performance Normalization and Degradation Indicators

To isolate soiling effects from variations in solar resource, we normalized PV output by incident irradiance:

P_{dc} / G_{POA}

(4)

Under clean conditions, this ratio remains relatively stable for a given module temperature range; significant deviations indicate performance degradation attributable to soiling, module aging, or other non-radiative losses.

To further discriminate between soiling and electrical faults, we monitored changes in DC voltage and current relative to expected values. Current reductions typically signal soiling or shading, as accumulated dust primarily impedes light transmission to the cells. Voltage drops, in contrast, may indicate increased series resistance, bypass diode activation, or other electrical issues. Tracking both parameters enables the model to distinguish between different failure modes and improves the specificity of cleaning predictions.

4.6.3. Soiling Metrics and Rolling Statistics

Soiling accumulation was quantified using the Soiling Loss Index (SLI) relative to a continuously cleaned reference string:

{SLI}_{i} (t) = 1 - \frac{P_{dc, i} (t)}{P_{dc, ref} (t)}

(5)

where

P_{d c, i}

is the power of string

i

and

P_{d c, r e f}

is the clean reference. A timestamp is labeled as requiring cleaning (

C D F = 1

) if

S L I > 5 %

, indicating significant power loss.

To capture the temporal dynamics of soiling, we computed rolling statistics over multiple time windows. For a series

x_{1}, x_{2}, \dots, x_{n}

and a window size

k

, the moving average is

{MA}_{t} = \frac{x_{t} + x_{t - 1} + \dots + x_{t - k + 1}}{k}

(6)

The following rolling statistics were computed for the SLI over 1 h, 3 h, and 24 h windows:

-: Mean SLI: Captures short- and long-term soiling trends, smoothing transient fluctuations caused by passing clouds or sensor noise.
-: Standard deviation of SLI: Quantifies variability in soiling levels, helping detect sudden dust deposition events from sandstorms or partial cleaning from light rainfall.
-: Rate of change (ΔSLI/Δt): First derivative of the SLI time series which identifies rapid soiling events requiring immediate intervention, such as those following a major dust storm.

A smoothed 24 h moving average was also applied to filter out short-term noise and highlight gradual accumulation patterns, enabling the model to distinguish between slow, continuous soiling buildup and abrupt changes requiring urgent maintenance action.

4.6.4. Temporal and Cyclical Features

PV performance and soiling accumulation exhibit strong temporal dependencies driven by daily solar cycles, seasonal weather patterns, and maintenance history. To enable the model to learn these periodic behaviors without imposing arbitrary discontinuities, temporal variables were encoded using sine and cosine transformations. For a given cyclic variable with period

P

, the transformations are

\sin_{t} = s i n (\frac{2 π t}{P})

(7)

\cos_{t} = c o s (\frac{2 π t}{P})

(8)

This encoding preserves the natural continuity of cyclical features, for example, hour 23 and hour 0 are adjacent on a circle, allowing the model to recognize smooth daily transitions in irradiance and temperature. The following temporal features were encoded using this approach:

-: Hour of the day ( $P$ = 24 h): Captures diurnal patterns in solar irradiance, module temperature, and soiling visibility.
-: Day of the year ( $P$ = 36 days): Models seasonal variations in solar elevation, ambient temperature, and dust accumulation rates.
-: Month of the year ( $P$ = 12 months): Provides a coarser seasonal indicator that complements day-of-year encoding.

4.6.5. Final Feature Set for LightGBM

In addition to cyclical encodings, we engineered a feature representing the time since the last cleaning or rainfall event. This variable captures the influence of recent maintenance or natural washing on current soiling levels: longer elapsed times generally correlate with higher soiling accumulation, while recent events temporarily restore clean conditions. This feature enables the model to account for the resetting effect of cleaning interventions and rainfall when predicting current soiling losses.

4.7. Data Provenance and Simulation Boundaries

To ensure complete transparency in our two-stage framework, this subsection explicitly defines the provenance of all data and confirms the independence of the two modeling stages.

4.7.1. Purely Measured Data

The following data types originate exclusively from physical sensors at the Shams Centre facility and are never simulated or algorithmically modified:

-: Environmental measurements: Plane-of-array irradiance ( $G_{P O A}$ ), module temperature ( $T_{m o d u l e}$ ), ambient temperature ( $T_{a m b i e n t}$ ), wind speed ( $ν_{w i n d}$ ), and relative humidity ( $R H$ );
-: Electrical measurements from soiled PV strings: DC power ( $P_{D C}$ ), DC current, and DC voltage (recorded continuously regardless of soiling state);
-: Clean-period electrical measurements: DC power, current, and voltage recorded during confirmed clean periods (used exclusively for training);
-: Operational logs: Manually recorded cleaning events and automated cleaning system logs;
-: Meteorological events: Rainfall measurements.

These measured data serve as the immutable foundation for all model training and validation.

4.7.2. Training Data for Stage 1 Regression Models

The regression models in Stage 1 (PLSR, PCR, Ridge, Lasso, Elastic Net, Robust) are trained exclusively on a subset of purely measured data: specifically, time periods identified from cleaning logs when PV modules were confirmed clean. These clean periods are defined as the 24–48 h window immediately following a documented cleaning event, before significant soiling re-accumulation occurs.

During these clean periods, the measured DC power represents the true clean-condition baseline for that specific environmental context. This clean-period data forms the training set for all regression models. No simulated data of any kind is used during this training phase.

4.7.3. Regression-Derived Simulated Data

The term “regression-based simulated data” refers strictly to the output of the trained regression models when applied to any input timeframe. Mathematically, for any timestamp

t

with measured environmental inputs

X_{t} = [G_{P O A, t}, T_{m o d u l e, t}, T_{a m b i e n t, t}, ν_{w i n d, t}, R H_{t}]

, the simulated clean-condition power is

{\hat{P}}_{c l e a n, t} = f_{m o d e l} (X_{t})

(9)

where

f_{m o d e l}

represents any of the trained regression models. This value

{\hat{P}}_{c l e a n, t}

is what we term “regression-based simulated data” it is a statistical inference of what the power would be under clean-conditions, generated by a model trained on historical clean-period measurements. It is not derived from physical soiling equations and is not a direct measurement.

Crucially, these simulated values are generated only after all Stage 1 models are fully trained and fixed.

4.7.4. Summary Table: Data Categories

Table 6 summarizes the data categories, their sources, usage, and whether they are simulated, providing a quick reference for understanding data provenance throughout the framework.

4.7.5. Independence of Stages and Absence of Feedback Loops

A critical design feature of our framework is the complete independence of the two stages. There is no feedback loop between Stage 2 and Stage 1. The regression models in Stage 1 are trained once on historical clean-period data and remain static during Stage 2 classification. The simulated clean-power outputs generated by Stage 1 serve as fixed inputs to Stage 2, but classification results do not retrain, update, or otherwise influence the regression models.

This unidirectional flow is illustrated schematically in Figure 6. The separation ensures experimental integrity when comparing cleaning decisions derived from real versus simulated data, as any differences can be attributed solely to the fidelity of the simulated clean-power estimates rather than to iterative model adaptation.

5. Methods

5.1. Stage 1: Regression Models for Clean-Power Prediction

5.1.1. Model Selection

The regression models employed in this study fall into three complementary categories: latent-variable regression (PLSR, PCR), regularized linear regression (Ridge, Lasso, Elastic Net), and outlier-resistant regression (Robust). A complete description of each model’s characteristics and selection rationale is provided in Table 7.

The real operational dataset from the Shams Solar Site is primarily characterized by two statistical properties: pronounced multicollinearity among input features and skewed-normal distributions in both environmental and electrical variables [50]. To reliably model PV performance under these conditions, all regression techniques were developed, evaluated, and analyzed within a unified modeling framework [50,51,52]. To ensure fair comparison and predictive consistency, all models were trained using a common supervised learning protocol, identical feature sets, and standardized preprocessing and validation procedures. This unified training strategy ensures that observed performance differences arise from intrinsic model characteristics rather than data handling or training inconsistencies.

5.1.2. Training Protocol

All regression models were trained using a unified supervised learning framework to ensure consistency and fair comparison. The framework employed the input features and target variable listed in Table 2, with >235,000 observations split chronologically into training (70%), validation (15%), and testing (15%) sets to preserve temporal dependencies.

Z-score normalization was applied to all continuous features to avoid scale dominance:

x_{i}^{'} = \frac{x_{i} - x^{-}}{σ_{x}}

(10)

where the following definitions are used:

x_{i}

: Original value of feature

x

for observation

i

(e.g.,

G_{P O A}

,

T_{m o d u l e}

);

x ¯

: Mean of feature

x

across all observations;

σ_{x}

: Standard deviation of feature

x

across all observations;

x_{i}^{'}

: Standardized value of the feature for observation

i

;

Hyperparameter optimization was performed using 5-fold time-series cross-validation on the training set with Mean Squared Error (MSE) as the optimization metric. Iterative models were trained until

Δ L o s s < 10^{- 4}

or maximum of 1000 iterations was reached. Table 8 provides a complete summary of the training protocol.

5.2. Stage 2: Machine Learning-Based Cleaning Classification

The cleaning decision problem was formulated as a supervised binary classification task, where the objective is to predict the Cleaning Demand Flag (CDF) from high-resolution PV operational data. Light Gradient Boosting Machine (LightGBM) was selected due to its ability to efficiently handle large, time-ordered datasets while capturing nonlinear relationships and high-order interactions among environmental conditions, soiling accumulation, and PV performance degradation. Its ensemble-based, gradient boosting architecture provides robust modeling of both gradual performance decline and infrequent cleaning events, while built-in regularization and weighted loss functions address overfitting and class imbalance.

The model was trained to predict the binary CDF using the engineered feature set summarized in Table 9. This feature set integrates primary performance metrics (clean power, soiled power, Soiling Loss Index), temporal and cyclical variables (cyclically encoded hour, day-of-year, month, and time since last cleaning or rainfall), time-based soiling statistics (rolling means, standard deviations, and rate of change in SLI over 1 h, 3 h, and 24 h windows), environmental interaction terms (

G_{P O A} \times T_{m o d u l e}

,

Δ T, H \times W)

, and power indicators (voltage and current changes).

To ensure temporal consistency and prevent information leakage, the dataset was split chronologically into 70% training, 15% validation, and 15% testing subsets. Model hyperparameters were optimized using Bayesian optimization combined with 5-fold time-series cross-validation on the training data. The final optimized model employed 64 leaves, a learning rate of 0.05, a maximum tree depth of 10, a minimum of 50 samples per leaf, a feature subsampling ratio of 0.8, and L1 and L2 regularization coefficients set to 1.0. Class imbalance, where cleaning events represent a minority of observations, was addressed through cost-sensitive learning by assigning higher misclassification penalties to the minority cleaning class.

The complete cleaning decision framework was evaluated under two parallel paradigms:

Paradigm A: Features derived from real measured PV power data.

Paradigm B: Features derived from regression-based simulated clean-power outputs.

This design enables direct assessment of whether cleaning schedules generated using simulated power data remain consistent with those obtained from real measurements, while maintaining LightGBM as a common decision-making engine. Model performance was evaluated using standard classification metrics (accuracy, precision, recall, F1-score, AUC), temporal alignment via Mean Absolute Time Error (MATE), operational concordance with ground truth cleaning events, and feature importance analysis.

5.3. Evaluation Metrics

5.3.1. Soiling Loss Index and Cleaning Demand Flag

The predicted clean-power values are then used to compute the Soiling Loss Index (SLI):

{SLI}_{i} = \frac{{\hat{P}}_{c, i} - P_{s, i}}{{\hat{P}}_{c, i}}

(11)

where

P_{s, i}

denotes the measured soiled DC power at the same timestamp.

The SLI provides a physically meaningful, normalized indicator of contamination severity that is independent of absolute power levels and ambient conditions.

A binary Cleaning Demand Flag (CDF) was derived by applying a fixed 5% power loss threshold to the SLI. This threshold is widely adopted in the literature [35,36,41] as it represents a point where the cost of cleaning is typically offset by the value of recovered energy, while remaining above typical sensor uncertainty [19,50].

The CDF is therefore defined as

{CDF}_{i} = \{\begin{matrix} 1, & {SLI}_{i} \geq 0.05 \\ 0, & {SLI}_{i} < 0.05 \end{matrix}

(12)

5.3.2. Regression Model Performance Metrics

Model performance in Stage 1 was assessed using the coefficient of determination (R²), Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), Mean Squared Error (MSE), and Mean Absolute Percentage Error (MAPE). Mathematical definitions for all regression metrics are provided in Appendix A (Appendix A.1 and Appendix A.2).

To evaluate the trade-off between accuracy and model complexity, the Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC) were also computed [52,53]. These information criteria enable direct comparison of predictive accuracy and model parsimony across different regression approaches, accounting for differences in the number of model parameters. Lower AIC and BIC values indicate a more favorable balance between goodness of fit and model simplicity. Mathematical formulations for AIC and BIC are provided in Appendix A (Appendix A.3).

5.3.3. Classification Performance Metrics

Classification performance was evaluated using standard metrics derived from the confusion matrix, including accuracy, precision, recall, and F1-score [54]. A fixed probability threshold of 0.5 was applied to convert the model’s probabilistic outputs into binary cleaning decisions. Mathematical definitions for all classification metrics are provided in Appendix A (Appendix A.4).

In addition to these threshold-based metrics, we evaluated discriminative ability using Receiver Operating Characteristic (ROC) curves and the Area Under the Curve (AUC). The ROC curve plots the True Positive Rate against the False Positive Rate across all possible classification thresholds, providing a comprehensive view of the model’s discriminative ability independent of threshold selection [52,53]. The AUC summarizes this performance as a single value between 0.5 (random guessing) and 1.0 (perfect classification), representing the probability that a randomly chosen positive instance receives a higher predicted probability than a randomly chosen negative instance [54].

5.3.4. Temporal Alignment Evaluation

In addition to point-wise classification accuracy, temporal consistency between predicted and actual cleaning events was evaluated using the Mean Absolute Time Error (MATE). For each predicted cleaning event, the absolute difference between the predicted time

t_{i}^{pred}

and the nearest ground-truth cleaning time

t_{i}^{actual}

was computed:

M A T E = \frac{1}{N} \sum_{i = 1}^{N} ∣ t_{i}^{p r e d} - t_{i}^{a c t u a l} ∣

(13)

where

N

denotes the number of matched cleaning events. MATE is expressed in days and reflects the practical impact of prediction errors on maintenance scheduling.

Example: If three cleaning events were predicted 1, 2, and 3 days away from the actual cleaning dates,

MATE = \frac{1 + 2 + 3}{3} = 2 days

(14)

Thus, on average, predicted cleaning events deviate by 2 days from actual events, which is a critical consideration for operational planning, as small timing errors can lead to unnecessary cleaning actions or extended energy losses.

5.3.5. Comparative Evaluation Framework

Classification performance metrics and temporal alignment measures were computed for both paradigms under identical evaluation conditions:

-: Paradigm A: Features derived from real measured PV power data;
-: Paradigm B: Features derived from regression-based simulated clean-power outputs (generated by statistical models trained on real measurements).

This direct comparison enables the quantification of how deviations in simulated clean-power inputs propagate through the classification stage and affect operational cleaning decisions.

Consistency in accuracy, F1-score, and MATE between the two paradigms indicates that regression-based simulated datasets can effectively replace real measured power data for automated PV cleaning prediction.

All metrics defined in this section are reported in Section 6 (Results), classification performance in Table 11, with regression model performance presented in Tables 13 and 14, and temporal alignment in Section 6.1.

5.3.6. Note on Adaptive Thresholds

The choice of a 5% Soiling Loss Index (SLI) threshold in this study was guided by both the prior literature and operational practice. Studies on PV soiling have commonly adopted threshold values in the range of 3–7% for triggering cleaning interventions, balancing energy loss against maintenance costs [35,36]. For instance, research on utility-scale PV plants has demonstrated that an optimized cleaning threshold of approximately 5% yields the most economical outcomes when considering prevailing electricity tariffs and cleaning expenses [35]. Furthermore, recent data-driven approaches to soiling estimation have successfully employed similar threshold values to maintain cost-effectiveness while preventing unnecessary cleaning operations [47,48]. The 0.5 probability threshold for classification represents the standard default in binary classification tasks and provides an unbiased decision boundary.

While a fixed threshold of

5 %

was used in this study for consistency and comparability, adaptive threshold optimization based on site-specific economic factors (energy tariffs, cleaning costs, degradation rates) represents an important direction for future research. This would allow the framework to dynamically adjust cleaning triggers based on real-time economic conditions rather than relying on a fixed percentage, as noted in Section 8.4.

6. Results

Following the methodological framework established in Section 3, Section 4 and Section 5, we now present the empirical results of applying this framework to the Shams Centre dataset. All metrics reported below are defined in Section 5.3.

6.1. Cleaning Alert Agreement Between Real and Simulated Data

The proposed framework was evaluated using 17,838 records comprising real operational measurements and regression-based simulated electrical data. As summarized in Table 10, the cleaning decisions obtained from both datasets are fully consistent, with zero differences in alert count and alert rate, indicating that the simulation framework does not modify cleaning frequency or scheduling.

The strong temporal alignment of predicted alerts demonstrates that daily and seasonal cleaning patterns are preserved, ensuring that maintenance actions occur at the same times when regression-based simulated data are used. In addition, the minimal misclassification risk confirms that false cleaning triggers and missed cleaning events are negligible. Overall, these results validate the reliability of the proposed simulation approach and its suitability for automated PV cleaning decision support.

6.2. Classification Performance

Table 11 presents the classification performance of the models. PLS, Elastic Net, LASSO, PCR, Ridge, and Robust achieved identical metrics for real and regression-based simulated datasets, with high accuracy, precision, recall, and F1-scores. Confusion matrix analysis showed minimal misclassifications and balanced false-positive and false-negative rates, confirming that the simulation process preserves the learned decision boundaries.

To visualize the discriminative ability of the classifiers across all possible threshold values, Receiver Operating Characteristic (ROC) curves were generated for both paradigms. Figure 7 presents these curves, comparing performance between models trained on real measured data (Paradigm A) and those trained on regression-based simulated data (Paradigm B).

The AUC values further confirm the excellent discriminative ability of all models, with scores ranging from 0.989 to 0.993 across both real and regression-based simulated datasets. These values, substantially above the 0.5 random guessing baseline and well exceeding the 0.8 threshold typically considered ‘good’ performance [53], indicate that the classifiers maintain near-perfect separation between cleaning and non-cleaning events regardless of whether features are derived from real or simulated power data. The minimal AUC differences between Paradigm A and Paradigm B (≤0.002) demonstrate that the regression-based simulation framework preserves the discriminative characteristics learned from real measurements. This aligns with the probabilistic interpretation of AUC as the probability that a randomly chosen positive instance receives a higher predicted score than a randomly chosen negative instance [54]. The ROC curves in Figure 6 visually confirm this consistency, with all curves hugging the top-left corner and showing minimal deviation between real and simulated data across all threshold values

6.3. Power Prediction Accuracy

The agreement between real and simulated electrical measurements is shown in Table 12. DC power exhibited the strongest correspondence, confirming its central role in cleaning decision logic. DC current showed high correlation with slightly higher variability, while DC voltage displayed moderate correlation, reflecting its lower sensitivity to soiling. Despite these differences, classification performance remained unaffected.

6.4. Correlation Analysis by Model

A detailed correlation breakdown is provided in Table 13. For all models, DC power and direct current exhibited strong to very strong R² values between real and regression-based simulated datasets, while voltage consistently showed moderate correlation. This pattern was observed uniformly across all regression approaches, indicating that differences in modeling strategy do not significantly alter the underlying electrical relationships captured by the simulation.

6.5. Additional Performance Metrics (MSE, AIC, BIC)

To complement correlation and classification analyses, additional metrics were computed to assess both predictive accuracy and model simplicity across all the regression approaches considered: PLS, Elastic Net, LASSO, PCR, Ridge, and Robust Regression. The metrics include Mean Squared Error (MSE), Akaike Information Criterion (AIC), and Bayesian Information Criterion (BIC), which provide complementary insights into absolute prediction errors and model parsimony. Detailed mathematical formulations of these metrics are provided in Appendix A (Appendix A.1 and Appendix A.2).

All metrics are summarized in Table 14, allowing comparison of model performance in terms of prediction accuracy, simplicity, and interpretability.

Table 15 demonstrates that all regression models achieve high predictive accuracy on both real and regression-based simulated datasets. Differences between real and simulated predictions are minimal, indicating that the synthetic clean-power outputs reliably replicate operational behavior. PLSR and Elastic Net exhibit a strong balance between accuracy and model complexity, as reflected in both MSE and information criteria (AIC/BIC), making them suitable candidates for generating Stage 2 cleaning classification features.

6.6. Model-Level Performance Comparison

Building on the previous analyses, the relative strengths and limitations of each regression approach are summarized in Table 14, considering classification performance, electrical reconstruction accuracy, robustness, and interpretability. While most models achieve comparable classification results, differences emerge in their handling of multicollinearity, sensitivity to noise, prediction stability, and suitability for automated deployment.

Consistent with the data boundaries defined in Section 4.7, we emphasize that the simulated clean-power data used throughout these analyses are generated exclusively by Stage 1 models trained on historical clean-period measurements. These simulated outputs serve as fixed inputs to Stage 2, with no feedback or retraining occurring between stages. This independence ensures that the comparative performance summarized in Table 14 reflects intrinsic model characteristics rather than iterative adaptation between stages.

Table 16 synthesizes the relative strengths of each regression approach. Notably, while most models achieve comparable classification outcomes, PLS and PCR demonstrate superior reconstruction fidelity, whereas Elastic Net and Ridge offer enhanced numerical stability. LASSO provides the most interpretable coefficients, and Robust Regression ensures resilience to outliers. The comparative evaluation emphasizes that model choice should balance accuracy, interpretability, and robustness, depending on operational requirements and deployment constraints.

The two-stage methodology that produced these results is summarized in Table 17. This framework ensures specialization, modularity, interpretability, robustness, and practicality, allowing each modeling component to focus on its optimal task while enabling clear mapping from physical power prediction to actionable maintenance decisions.

The framework’s modular design supports the independent optimization and validation of each stage, while the consistent performance across multiple regression approaches demonstrated in Table 9, Table 10, Table 11, Table 12, Table 13 and Table 14 confirms that the architecture reliably converts power predictions into accurate cleaning schedules regardless of the specific modeling technique employed.

7. Discussion

PLS Regression (PLSR) achieved the highest fit for real data, benefiting from its supervised latent-variable projection, and maintained nearly identical performance on regression-based simulated data. PCR provided a slightly lower fit due to its unsupervised latent-variable approach but still offered reliable reconstruction of DC power.

Among the regularized models, Ridge and Elastic Net balanced accuracy and stability effectively, achieving strong fits on both real and regression-based simulated data. Ridge regression demonstrated robust handling of multicollinearity, while Elastic Net combined sparsity with numerical stability, showing consistent performance even on synthetic datasets. LASSO produced a sparse and interpretable solution with slightly higher error, reflecting a modest trade-off between sparsity and reconstruction accuracy. Robust Regression resisted the influence of outliers, maintaining comparable predictive performance across both datasets

Evaluation metrics including RMSE, MSE, MAE, AIC, and BIC provide complementary insights. All models achieved low RMSE and MSE, indicating high predictive fidelity, while AIC and BIC values confirm that the models maintain a favorable balance between accuracy and complexity. Differences between real and regression-based simulated datasets were minor, highlighting the framework’s ability to preserve operational cleaning patterns and accurately capture PV system behavior.

Overall, the analysis confirms that the framework reliably predicts cleaning requirements across diverse regression approaches. PLS and Elastic Net offer a combination of accuracy and interpretability, PCR excels in high-fidelity reconstruction, LASSO provides sparse solutions for easier coefficient analysis, Ridge ensures stability under multicollinearity, and Robust Regression protects against outliers. These results demonstrate that the proposed framework effectively integrates environmental, temporal, and soiling-related features to generate precise, automated cleaning schedules.

7.1. Interpretation of Key Findings

The consistently high correlation between real and regression-based simulated data for DC power (R > 0.97) and current (R > 0.95) indicates that statistical models trained on historical measurements successfully capture the underlying relationships between environmental conditions and electrical output. This fidelity arises from the models’ ability to learn site-specific characteristics including local irradiance patterns, thermal response rates, and soiling accumulation kinetics that physics-based simulations would require extensive parameterization to replicate [18,20]. The supervised latent-variable approach of PLSR demonstrated superior performance in capturing the covariance structure between environmental predictors and power output, consistent with findings in [21,22,23,24,25]. This advantage stems from PLSR’s ability to maximize covariance between predictors and target variable information during dimensionality reduction, unlike PCR which focuses solely on predictor variance [26].

The moderate correlation observed for DC voltage (R ≈ 0.52) across all models reflects the physical reality that voltage is less sensitive to soiling than current under the environmental conditions prevalent at the Shams Centre site. This aligns with findings from [26,29], who reported that soiling-induced performance degradation manifests primarily through current reduction in monocrystalline silicon modules operating at moderate temperatures. The consistency of this voltage insensitivity across all regression approaches confirms that this phenomenon is physical rather than model dependent.

The near-perfect agreement between cleaning decisions derived from real and simulated data (F1-score difference < 0.001, AUC difference ≤ 0.002) demonstrates that the regression-based simulation framework preserves the discriminative characteristics learned from real measurements. This decision-level equivalence is particularly significant because it shows that errors in continuous power prediction do not propagate forward to the binary classification stage a finding with important practical implications.

7.2. Model Performance Synthesis

The reconstruction fidelity achieved by latent-variable methods (PLSR, PCR) in this study (R² > 0.95 for power) exceeds those reported by [21] (R² ≈ 0.91) and [24] (R² ≈ 0.93) for similar PV datasets. This improvement can be attributed to the inclusion of engineered interaction terms (ΔT, G × T) that capture nonlinear thermodynamic effects often overlooked in standard regression approaches [27,28]. The regularized models (Ridge, Lasso, Elastic Net) demonstrated robust handling of multicollinearity, confirming their suitability for PV applications where environmental predictors are inherently correlated [22,23].

The classification performance (F1-score = 0.979, AUC = 0.99) substantially surpasses the results reported in recent PV cleaning decision framework studies. For example, ref. [47] achieved F1-scores of 0.82–0.89 using threshold-based methods, while [48] reported AUC values of 0.91–0.94 with tree-based classifiers. This advancement stems from three factors: (i) the two-stage architecture that decouples power prediction from classification, (ii) the rich feature engineering incorporating soiling dynamics (rolling statistics, rate of change, environmental interactions), and (iii) the high-quality training data from the Shams Centre facility [50,51,52].

The consistency across six regression models (PLSR, PCR, Ridge, Lasso, Elastic Net, Robust) demonstrates that the framework’s effectiveness is not tied to a specific modeling approach. This robustness aligns with ensemble principles in machine learning [43], suggesting that the engineered features and two-stage architecture rather than any single algorithm drive the strong performance.

7.3. Implications for PV Maintenance

The near-perfect agreement between cleaning decisions derived from real and simulated data has significant practical implications. For PV installations in data-constrained environments such as remote desert locations or large-scale plants with limited sensor coverage regression-based simulated data can effectively substitute for missing or unreliable measurements [13,14,15]. This enables predictive maintenance where it would otherwise be infeasible, addressing a key barrier to PV optimization in developing regions [1,4].

From an economic perspective, the high precision (0.98) and recall (0.98) translate to tangible operational benefits. False-positive cleaning events which incur unnecessary labor, water, and equipment costs are minimized, while false-negative events that would allow prolonged energy losses are avoided. This aligns with techno-economic assessments by [35,36], who estimated that optimized cleaning schedules can reduce operational expenditure by 20–40% compared to fixed-interval cleaning. The framework’s ability to detect both rapid soiling events (via ΔSLI/Δt) and gradual accumulation (via rolling means) ensures timely interventions across diverse soiling scenarios [31,34].

The modular two-stage architecture supports integration with existing PV monitoring infrastructure and automated cleaning systems [40,45]. Practitioners can select regression approaches based on operational priorities-LASSO for interpretability, Ridge for stability under multicollinearity, Robust for outlier-prone environments without compromising cleaning schedule accuracy. This flexibility positions the framework as a practical tool for diverse climatic regions and system configurations [23,38,46].

7.4. Comparison with Previous Studies

The findings of this study align with and extend the existing literature on PV soiling and predictive maintenance. Consistent with [21,22,23,24,25], regularized regression methods demonstrated effective handling of multicollinearity among environmental predictors. The high reconstruction fidelity achieved by PLSR and PCR supports the findings of [26,27,28,29,30], who reported that latent-variable projection methods effectively capture the covariance structure between meteorological variables and electrical output. The high AUC values (0.989–0.993) observed across all models exceed the performance typically reported for PV cleaning classifiers in the literature, where AUC values in the range of 0.85–0.95 are more common [47,48], highlighting the effectiveness of the two-stage framework and engineered features.

The near-perfect agreement between cleaning decisions derived from real and simulated datasets addresses a critical gap identified in recent reviews [6,15,39]. While previous studies focused primarily on signal-level reconstruction accuracy [8,9,10,11,12,16,17,18,19,20], this work demonstrates that simulation-driven frameworks can achieve decision-level equivalence, maintaining operational consistency in cleaning schedules. This extends the work of [50,51,52], who established the statistical properties of the Shams Centre dataset, by demonstrating that these properties do not limit generalizability to simulated data.

The inclusion of engineered features capturing environmental interactions and soiling dynamics builds upon recommendations in [31,32,33,34] for the improved detection of both rapid and gradual soiling events. The success of rolling statistics (mean SLI, standard deviation, rate of change) confirms the importance of temporal context in soiling prediction, supporting findings from [8,9] on the value of time-series features.

7.5. Practical Implications

While the framework demonstrates robust performance, several limitations warrant consideration. First, the regression-based simulated data are inherently constrained by the training data distribution; extreme environmental conditions outside the historical range may not be accurately represented [19]. This suggests that periodic model retraining with updated data is necessary to maintain accuracy under changing climatic conditions.

Second, the site-specific nature of engineered features may limit generalizability to PV systems with different module technologies, orientations, or climatic profiles [23,38]. The Shams Centre facility uses monocrystalline silicon modules in a hot, arid coastal climate; validation across additional sites with different technologies (polycrystalline, thin-film) and climates (temperate, tropical, high-altitude) would strengthen confidence in the framework’s transferability.

Third, the fixed 5% SLI threshold, while empirically validated for this installation and supported by the literature [35,36,41], may not be economically optimal across varying energy tariffs and cleaning costs [40]. Future implementations could benefit from adaptive thresholding that considers real-time electricity prices, water costs, and degradation rates.

Fourth, although six regression models were evaluated, the exploration of deep learning approaches (LSTM, CNN) for power prediction remains for future work. Given the temporal nature of PV data, sequence models may capture longer-range dependencies than the current regression framework [43,44].

Finally, the framework’s reliance on high-quality training data means that sensor malfunctions or missing cleaning logs could degrade performance. Implementation in operational settings should include data quality monitoring and fallback procedures [50].

8. Conclusions and Future Work

8.1. Summary of Contributions

This study presented a comprehensive framework for predicting Photovoltaic (PV) panel cleaning requirements using real operational data and simulated clean-power datasets. By leveraging advanced feature engineering, temporal and environmental interactions, and multiple regression models including PLS, Elastic Net, LASSO, PCR, Ridge, and Robust Regression, the framework demonstrated high accuracy and reliability in detecting soiling-induced performance degradation.

8.2. Key Findings

-: Feature Engineering for Soiling Detection:

Engineered features capturing module–ambient temperature differences, irradiance–temperature interactions, and soiling accumulation trends enabled the models to identify subtle performance deviations. Time-based and rolling soiling metrics enhanced the model’s ability to detect both rapid and gradual soiling events, supporting timely cleaning interventions.

-: Model Evaluation and Performance:

Across 17,838 real and regression-based simulated observations, all models achieved high classification accuracy for cleaning predictions. PLS and PCR excelled in reconstructing electrical outputs with latent-variable projections, while Elastic Net balanced sparsity and numerical stability. LASSO provided interpretable sparse solutions, and Ridge and Robust Regression delivered consistent performance under multicollinearity and outlier conditions. MSE, AIC, and BIC metrics confirmed both predictive fidelity and model parsimony, complementing correlation-based assessments.

-: Simulation Accuracy:

The simulated clean-power datasets preserved the timing, frequency, and magnitude of cleaning alerts observed in real measurements, demonstrating the framework’s capability to support PV cleaning decision-making even when only synthetic data are available. Temporal alignment and zero differences in alert counts ensured that maintenance schedules remain operationally consistent.

-: Operational Implications:

The framework provides an effective automated cleaning decision support system, capable of minimizing missed-cleaning and false-cleaning events while reducing reliance on manual inspection. By integrating multiple regression approaches and a robust feature set, the methodology ensures scalability and adaptability to different PV installations and environmental conditions.

8.3. Limitations

First, the model is sensor-dependent, relying on high-resolution electrical and environmental measurements; any sensor malfunction or missing data could reduce prediction accuracy.

Second, the site-specific nature of engineered features, such as soiling indices and environmental interactions, may limit the framework’s generalizability to PV systems with different module types, orientations, or climatic conditions. It is important to emphasize that the dataset used in this study originates exclusively from an arid desert climate (Oman), characterized by high solar irradiance, elevated temperatures, low rainfall, and significant dust accumulation. While this provides an excellent testbed for evaluating soiling effects, the framework’s direct transferability to other climatic regions such as temperate zones with frequent rainfall, tropical regions with high humidity, or snowy climates cannot be assumed without validation. Different soiling mechanisms (e.g., snow coverage, bird droppings in coastal areas, pollen accumulation in agricultural regions) and natural cleaning patterns (e.g., regular rain) would likely require model retraining and potential feature adaptation to maintain predictive accuracy.

Third, although six regression models (PLS, Elastic Net, LASSO, PCR, Ridge, and Robust) were evaluated, other regression or machine learning approaches remain unexplored, and may provide additional improvements in adaptability, interpretability, or reconstruction fidelity.

Finally, accurate predictions require consistent operational logs; missing or delayed cleaning records could negatively impact model reliability in real-world deployments.

8.4. Future Work

Future research will focus on extending the framework across multiple PV sites and climatic regions, incorporating additional environmental and operational variables, and evaluating a broader set of regression and hybrid models, including physics-informed approaches. A critical next step is conducting cross-site validation studies that apply the current framework to PV systems operating in diverse climatic zones including temperate, tropical, Mediterranean, and snowy regions to empirically assess transferability and identify necessary adaptations. A particularly important direction is the development of adaptive threshold optimization frameworks that dynamically adjust cleaning triggers based on site-specific economic factors. This would involve integrating real-time data on local electricity tariffs, cleaning operation costs, and module degradation rates to replace the fixed 5% threshold with an economically optimal, time-varying threshold. Such an approach would enable the framework to make maintenance decisions that directly maximize return on investment rather than simply maintaining a fixed performance loss criterion. Furthermore, we plan to explore transfer learning techniques that would allow models pre-trained on data from well-instrumented sites (such as the Shams Centre) to be fine-tuned with limited local data from new installations. This approach could significantly reduce the data requirements for deploying the framework in new climatic regions while maintaining predictive accuracy. Efforts will also target automating feature adaptation, accounting for seasonal or degradation-related variations, integrating with automated cleaning scheduling, enhancing model interpretability, and quantifying prediction uncertainty to ensure reliable and scalable deployment in diverse operational contexts. Addressing these limitations through the proposed future work will further enhance the framework’s robustness, generalizability, and practical utility for PV maintenance optimization across diverse operational contexts.

Author Contributions

A.A.-H., principal investigator, methodology, resources, conceptualization, analysis, and writing—review and editing. E.K., methodology, analysis, and writing—draft preparation. Z.A.A.H., analysis, writing—review and editing, and investigation. P.J., supervision. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

You may contact the corresponding author for more information.

Acknowledgments

We acknowledge support by the Open Access Publication Fund of the University of Duisburg-Essen.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

Definition of Evaluation Metrics
Let $y_{i}$ denote the measured (observed) value,
${\hat{y}}_{i}$ the predicted value,
$\bar{y}$ the mean of measured values,
$\bar{\hat{y}}$ the mean of predicted values, and
$n$ the total number of observations

The following metrics were used to evaluate regression model performance:

Appendix A.1. Coefficient of Determination ( $R^{2}$ )

The coefficient of determination represents the proportion of variance in the measured data explained by the model:

R^{2} = 1 - \frac{\sum_{i = 1}^{n} (y_{i} - {\hat{y}}_{i})^{2}}{\sum_{i = 1}^{n} (y_{i} - \bar{y})^{2}}

(A1)

Appendix A.2. Regression Error Metrics

Mean Absolute Error (MAE): Quantifies the average magnitude of prediction errors:

MAE = \frac{1}{n} \sum_{i = 1}^{n} ∣ y_{i} - {\hat{y}}_{i} ∣

(A2)

Mean Squared Error (MSE): Represents the average squared difference between predicted and measured values:

MSE = \frac{1}{n} \sum_{i = 1}^{n} (y_{i} - {\hat{y}}_{i})^{2}

(A3)

Root Mean Squared Error (RMSE): The square root of MSE, providing error in the original units:

RMSE = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} (y_{i} - {\hat{y}}_{i})^{2}}

(A4)

Mean Absolute Percentage Error (MAPE): Expresses prediction error as a percentage:

MAPE = \frac{100 %}{n} \sum_{i = 1}^{n} ∣ \frac{y_{i} - {\hat{y}}_{i}}{y_{i}} ∣

(A5)

Appendix A.3. Information Criteria

Let k be the number of model parameters and

RSS = \sum_{i = 1}^{n} (y_{i} - {\hat{y}}_{i})^{2}

be the residual sum of squares.

Akaike Information Criterion (AIC): Balances goodness of fit against model complexity:

AIC = n · ln(RSS/n) + 2k

(A6)

Bayesian Information Criterion (BIC): Similar to AIC with a stronger penalty for complexity:

BIC = n · ln(RSS/n) + k · ln(n)

(A7)

Lower AIC and BIC values indicate a better trade-off between accuracy and model parsimony.

Appendix A.4. Classification Metrics

Accuracy: Overall proportion of correctly classified instances:

Accuracy = (TP + TN)/(TP + TN + FP + FN)

(A8)

Precision: Fraction of predicted positive cases that were correct:

Precision = TP/(TP + FP)

(A9)

Recall (Sensitivity): Fraction of actual positive cases correctly identified:

Recall = TP/(TP + FN)

(A10)

F1-Score: Harmonic mean of precision and recall, providing a balanced measure:

F1 = 2 · Precision · Recall/(Precision + Recall)

(A11)

Area Under the ROC Curve (AUC): Summarizes the model’s discriminative ability across all classification thresholds. The ROC curve plots the True Positive Rate (TPR) against the False Positive Rate (FPR), with AUC values ranging from 0.5 (random guessing) to 1.0 (perfect classification (54)).

References

Mani, M.; Pillai, R. Impact of dust on solar photovoltaic (PV) performance: Research status, challenges and recommendations. Renew. Sustain. Energy Rev. 2010, 14, 3124–3131. [Google Scholar] [CrossRef]
Sarver, T.; Al-Qaraghuli, A.; Kazmerski, L.L. A comprehensive review of the impact of dust on the use of solar energy: History, investigations, results, literature, and mitigation approaches. Renew. Sustain. Energy Rev. 2013, 22, 698–733. [Google Scholar] [CrossRef]
Ilse, K.K.; Figgis, B.W.; Naumann, V.; Hagendorf, C.; Bagdahn, J. Fundamentals of soiling processes on photovoltaic modules. Renew. Sustain. Energy Rev. 2018, 98, 239–254. [Google Scholar] [CrossRef]
Ilse, K.K.; Figgis, B.W.; Werner, M.; Naumann, V.; Hagendorf, C.; Pöllmann, H.; Bagdahn, J. Comprehensive analysis of soiling and cementation processes on PV modules in Qatar. Sol. Energy Mater. Sol. Cells 2018, 186, 309–323. [Google Scholar] [CrossRef]
Bergin, M.H.; Ghoroi, C.; Dixit, D.; Schauer, J.J.; Shindell, D.T. Large reductions in solar energy production due to dust and particulate air pollution. Environ. Sci. Technol. Lett. 2017, 4, 339–344. [Google Scholar] [CrossRef]
Bessa, J.G.; Micheli, L.; Almonacid, F.; Fernández, E.F. Monitoring photovoltaic soiling: Assessment, challenges, and perspectives of current and potential strategies. iScience 2021, 24, 102165. [Google Scholar] [CrossRef]
Toth, S.; Hannigan, M.; Vance, M.; Deceglie, M. Predicting photovoltaic soiling from air quality measurements. IEEE J. Photovolt. 2020, 10, 1142–1147. [Google Scholar] [CrossRef]
Coello, M.; Boyle, L. Simple model for predicting time series soiling of photovoltaic panels. IEEE J. Photovolt. 2019, 9, 1382–1387. [Google Scholar] [CrossRef]
You, S.; Lim, Y.J.; Dai, Y.; Wang, C.-H. On the temporal modelling of solar photovoltaic soiling: Energy and economic impacts in seven cities. Appl. Energy 2018, 228, 1136–1146. [Google Scholar] [CrossRef]
Javed, W.; Guo, B.; Figgis, B. Modeling of photovoltaic soiling loss as a function of environmental variables. Sol. Energy 2017, 157, 397–407. [Google Scholar] [CrossRef]
Maghami, M.R.; Hizam, H.; Gomes, C.; Radzi, M.A.; Rezadad, M.I.; Hajighorbani, S. Power loss due to soiling on solar panel: A review. Renew. Sustain. Energy Rev. 2016, 59, 1307–1316. [Google Scholar] [CrossRef]
Mares, O.; Paulescu, M.; Badescu, V. A simple but accurate procedure for solving the five-parameter PV model. Energy Convers. Manag. 2015, 105, 139–148. [Google Scholar] [CrossRef]
Deng, J.; Yang, M.; Ma, R.; Zhu, X.; Fan, J.; Yuan, G.; Wang, Z. Validation of a simple dynamic thermal performance characterization model based on the piston flow concept for flat-plate solar collectors. Sol. Energy 2016, 139, 171–178. [Google Scholar] [CrossRef]
Zhu, Z.; Wu, T.; Xie, K.; Wang, M.; Li, J.; Ahmed, A. Long-term forecasting of solar and wind pattern using deep learning. PeerJ Comput. Sci. 2025, 11, e3114. [Google Scholar] [CrossRef]
Singh, U.; Singh, N.; Malik, H.; Hossaini, M.A. Photovoltaic Power Forecasting: A Review on models and Future research Directions. IET Renew. Power Gener. 2026, 20, e70186. [Google Scholar] [CrossRef]
De Soto, W.; Klein, S.A.; Beckman, W.A. Improvement and validation of a model for photovoltaic array performance. Sol. Energy 2006, 80, 78–88. [Google Scholar] [CrossRef]
King, D.L.; Kratochvil, J.A.; Boyson, W.E. Photovoltaic Array Performance Model. Ph.D. Thesis. 2004. Available online: https://www.researchgate.net/publication/256080217_Photovoltaic_Array_Performance_Model (accessed on 9 March 2026).
Johnson, R.; Boubrahimi, S.F.; Bahri, O.; Hamdi, S.M. Combining Empirical and Physics-Based Models for Solar Wind Prediction. Universe 2024, 10, 191. [Google Scholar] [CrossRef]
Micheli, L.; Fernández, E.F.; Muller, M.; Smestad, G.P.; Almonacid, F. Selection of optimal wavelengths for optical soiling modelling and detection in photovoltaic modules. Sol. Energy Mater. Sol. Cells 2020, 212, 110539. [Google Scholar] [CrossRef]
Fernández, E.F.; Chemisana, D.; Micheli, L.; Almonacid, F. Spectral nature of soiling and its impact on multi-junction based concentrator systems. Sol. Energy Mater. Sol. Cells 2021, 201, 110118. [Google Scholar] [CrossRef]
Gholami, A.; Hamdou, B.; Kozak, D.; Baniya, S. Experimental investigation of dust deposition effect on photovoltaic output performance. Sol. Energy 2018, 159, 346–352. [Google Scholar] [CrossRef]
Toth, S.; Muller, M.; Miller, D.C.; Moutinho, H.; To, B.; Micheli, L.; Linger, J.; Engtrakul, C.; Einhorn, A.; Simpson, L. Soiling and cleaning: Initial observations from 5-year photovoltaic glass coating durability study. Sol. Energy Mater. Sol. Cells 2018, 185, 375–384. [Google Scholar] [CrossRef]
Alkharusi, T.; Russo, G.; Oyeniran, A.; Pandey, C.; Buonomano, A.; Palombo, A.; Markides, C.N. Quantifying the influence of soiling on PV module energy production based on a coupled opto-thermal-electrical model: An international assessment in different climatic conditions. Renew. Energy 2026, 256, 124119. [Google Scholar] [CrossRef]
Deb, D.; Brahmbhatt, N.L. Review of yield increase of solar panels through soiling prevention, and a proposed water-free automated cleaning solution. Renew. Sustain. Energy Rev. 2018, 82, 3306–3313. [Google Scholar] [CrossRef]
Dehghan, M.; Rashidi, S.; Waqas, A. Modeling of soiling losses in solar energy systems. Sustain. Energy Technol. Assess. 2022, 53, 102435. [Google Scholar] [CrossRef]
Kazem, H.A.; Sopian, K.; Al-Waeli, A.H.A.; Chaichan, M.T. Dust ingredients influence on photovoltaic module performance: An experimental investigation. Energy Sources Part A Util. Environ. Eff. 2024, 46, 9529–9550. [Google Scholar] [CrossRef]
Deb, D.; Bhargava, K. Water-Free automated sola-panel cleaning. In Degradation, Mitigation, and Forecasting Approaches in Thin Film Photovoltaics; Academic Press: Cambridge, MA, USA, 2022. [Google Scholar] [CrossRef]
Mejia, F.A.; Kleissl, J. Soiling losses for solar photovoltaic systems in California. Sol. Energy 2013, 95, 357–363. [Google Scholar] [CrossRef]
Ghosh, S.; Roy, J.N.; Chakraborty, C. A model to determine soiling, shading and thermal losses from PV yield data. Clean Energy 2022, 6, 372–391. [Google Scholar] [CrossRef]
Hachicha, A.A.; Al-Sawafta, I.; Ben Hamadou, D. Numerical and Experimental Investigations of Dust Effect on CSP Performance Under United Arab Emirates Weather Conditions. Renew. Energy 2019, 143, 263–276. [Google Scholar] [CrossRef]
Alnasser, T.M.A.; Mahdy, A.M.J.; Abass, K.I.; Chaichan, M.T.; Kazem, H.A. Impact of dust ingredient on photovoltaic performance: An experimental study. Sol. Energy 2020, 195, 651–659. [Google Scholar] [CrossRef]
Ghazi, S.; Ip, K. The effect of weather conditions on the efficiency of PV panels in the southeast of UK. Renew. Energy 2014, 69, 50–59. [Google Scholar] [CrossRef]
Chaichan, M.T.; Kazem, H.A.; Al-Waeli, A.H.A.; Sopian, K.; Fayad, M.A.; Alawee, W.H.; Dhahad, H.A.; Isahak, W.N.R.W.; Al-Amiery, A.A. Sand and Dust Storms’ Impact on the Efficiency of the Photovoltaic Modules Installed in Baghdad: A Review Study with an Empirical Investigation. Energies 2023, 16, 3938. [Google Scholar] [CrossRef]
Vedulla, G.; Geetha, A.; Senthil, R. Review of Strategies to Mitigate Dust Deposition on Solar Photovoltaic Systems. Energies 2023, 16, 109. [Google Scholar] [CrossRef]
Ilse, K.; Micheli, L.; Figgis, B.W.; Lange, K.; Daßler, D.; Hanifi, H.; Wolfertstetter, F.; Naumann, V.; Hagendorf, C.; Gottschalg, R.; et al. Techno-Economic Assessment of Soiling Losses and Mitigation Strategies for Solar Power Generation. Joule 2019, 3, 2303–2321. [Google Scholar] [CrossRef]
Redondo, M.; Platero, C.A.; Moset, A.; Rodríguez, F.; Donate, V. Soiling Modelling in Large Grid-Connected PV Plants for Cleaning Optimization. Energies 2023, 16, 904. [Google Scholar] [CrossRef]
Ahmed, A.M.; Li, L.; Khalilpour, K. Predictive maintenance of solar photovoltaic systems: A comprehensive review. IET Renew. Power Gener. 2025, 19, e70152. [Google Scholar] [CrossRef]
Dahlioui, D.; Laarabi, B.; Barhdadi, A. Investigation of soiling impact on PV modules performance in semi-arid and hyper-arid climates in Morocco. Energy Sustain. Dev. 2019, 51, 32–39. [Google Scholar] [CrossRef]
Fezzani, A.; Mahammed, I.; Drid, S.; Zaghba, L.; Bouchakour, A.; Messaousa, K.; Oudjana, S. Degradation and performance evaluation of PV modules in desert climate conditions with estimated measurement uncertainty. Serb. J. Electr. Eng. 2017, 14, 277–299. [Google Scholar] [CrossRef]
Al-Housani, M.; Bicer, Y.; Koç, M. Experimental investigations on PV cleaning of large-scale solar power plants in desert climates: Comparison of cleaning techniques for drone retrofitting. Energy Convers. Manag. 2019, 185, 800–815. [Google Scholar] [CrossRef]
Ribeiro, K.; Santos, R.; Saraiva, E.F.; Rajagopal, R. A statistical methodology to estimate soiling losses on photovoltaic solar plants. J. Sol. Energy Eng. 2021, 143, 064501. [Google Scholar] [CrossRef]
Fang, M.; Qian, W.; Qian, T.; Bao, Q.; Zhang, H.; Qiu, X. DGImNet: A deep learning model for photovoltaic soiling loss estimation. Appl. Energy 2024, 376, 124335. [Google Scholar] [CrossRef]
Vaishak, S.; Bhale, P.V. Effect of dust deposition on performance characteristics of a refrigerant-based photovoltaic/thermal system under outdoor conditions in India. Sustain. Energy Technol. Assess. 2018, 36, 100548. [Google Scholar] [CrossRef]
Yazdani, H.; Yaghoubi, M. Dust deposition effect on photovoltaic modules performance and optimization of cleaning period: A combined experimental–numerical study. Sustain. Energy Technol. Assess. 2022, 51, 101946. [Google Scholar] [CrossRef]
Diniz, A.S.A.C.; Duarte, T.P.; Costa, S.A.C.; Braga, D.S.; Santana, V.C.; Kazmerski, L.L. Soiling Spectral and Module Temperature Effects: Comparisons of Competing Operating Parameters for Four Commercial PV Module Technologies. Energies 2022, 15, 5415. [Google Scholar] [CrossRef]
Abuzaid, H.; Awad, M.; Shamayleh, A.; Alshraideh, H. Predictive modeling of photovoltaic system cleaning schedules using machine learning techniques. Renew. Energy 2025, 239, 122149. [Google Scholar] [CrossRef]
Redondo, M.; Platero, C.A.; Moset, A.; Rodríguez, F.; Donate, V. Review and comparison of methods for soiling modeling in large grid-connected PV plants. Sustainability 2024, 16, 10998. [Google Scholar] [CrossRef]
Menoufi, K. Dust Accumulation on the Surface of Photovoltaic Panels: Introducing the Photovoltaic Soiling Index (PVSI). Sustainability 2017, 9, 963. [Google Scholar] [CrossRef]
Al Humairi, A.; El Asri, H.; Al Hemyari, Z.A.; Jung, P. Assessing the Features of PV System’s Data and the Soiling Effects on PV System’s Performance Based on the Field Data. Energies 2024, 17, 4419. [Google Scholar] [CrossRef]
Al Humairi, A.; Al Hemyari, Z.A.; El Asri, H.; Jung, P. Modelling the Performance of Photovoltaic Systems and Studying the Soiling Effects: Insights Based on Field Data of Environmental Factors of Solar Panel Systems. Appl. Syst. Innov. 2024, 8, 25. [Google Scholar] [CrossRef]
Al Humairi, A.; El Asri, H.; Al Hemyari, Z.A.; Jung, P. A Robust Modeling Analysis of Environmental Factors Influencing the Direct Current, Power, and Voltage of Photovoltaic Systems. Electronics 2024, 14, 2647. [Google Scholar] [CrossRef]
Al-Humairi, A.; Khalis, E.; Al-Hemyari, Z.A.; Jung, P. Machine Learning-Based Predictive Maintenance for Photovoltaic Systems. AI 2025, 6, 133. [Google Scholar] [CrossRef]
Al-Humairi, A.; Khalis, E.; Al Hemyari, Z.A.; Jung, P. The Impact of Data Augmentation on AI-Driven Predictive Algorithms for Enhanced Solar Panel Cleaning Efficiency. Processes 2025, 13, 1195. [Google Scholar] [CrossRef]
Zaporozhets, A.; Matushkin, D. Predictive Models of Solar Generation in the Microgrid. In Smart Charging in Solar Microgrids; Zaporozhets, A., Ed.; Lecture Notes in Electrical Engineering; Springer: Cham, Switzerland, 2026; Volume 1518. [Google Scholar] [CrossRef]
Matushkin, D.; Zaporozhets, A.; Babak, V.; Kulyk, M.; Denysov, V. Hourly Photovoltaic Power Forecasting Using Exponential Smoothing: A Comparative Study Based on Operational Data. Solar 2025, 5, 48. [Google Scholar] [CrossRef]
Ladjal, B.; Nadour, M.; Bechouat, M.; Hadroug, N.; Sedraoui, M.; Rabehi, A.; Guermoui, M.; Agajie, T.F. Hybrid deep learning CNN-LSTM model for forecasting direct normal irradiance: A study on solar potential in Ghardaia, Algeria. Sci. Rep. 2025, 15, 15404. [Google Scholar] [CrossRef]
Ayata, F.; Seyyarer, E.; Yüzer, E. Solar Irradiation Prediction Using Time Series Models: Performance Evaluation with Hyperparameter Optimization and K-Fold Cross-Validation. J. Electr. Eng. Technol. 2026. [Google Scholar] [CrossRef]

Figure 1. Methodological Framework for PV Cleaning Predictions.

Figure 2. Correlation Heatmap.

Figure 3. Voltage–Time Curve.

Figure 4. Dataset integrity verification: (a) DC power distribution; (b) Plane-of-array irradiance pattern; (c) Overlay of DC power and irradiance demonstrating expected diurnal correlation; (d) Voltage distribution, illustrating the operating voltage range and confirming stable system performance under varying conditions.

Figure 5. Scaling and Normalization for PV Model Features.

Figure 6. Schematic of Data Flow and Simulation Boundaries in the Two-Stage Framework. This diagram illustrates the provenance of all data and the unidirectional flow between stages. Color coding: Blue = purely measured data; Green = training data subset (measured clean periods); Orange = regression-derived simulated outputs; Gray = derived features and final decisions. Solid arrows indicate data flow.

Figure 7. ROC curves for LightGBM classification: (a) Paradigm A (real data, AUC = 0.992); (b) Paradigm B (regression-based simulated data, AUC = 0.991); (c) Overlay Comparison showing near-identical performance. Dashed diagonal line indicates random guessing (AUC = 0.5).

Table 1. Summary of Dataset Characteristics.

Aspect	Description
Source	Shams Centre, German University of Technology in Oman (GUtech)
System Size	30 kW photovoltaic research facility
Recording Period	[Insert date range, e.g., January 2023–December 2024]
Resolution	5 min intervals
Total Observations	>235,000 multivariate time-series records
Electrical Variables	DC power, DC current, DC voltage, AC power, inverter efficiency
Environmental Variables	Plane-of-array irradiance, module temperature, ambient temperature, relative humidity, wind speed, wind direction, rainfall
Operational Logs	Manual and automated cleaning events (timestamped)
Data Quality	Validated sensor network with quality assurance protocols [50,51]

Table 2. Descriptive Statistics of Key Electrical and Environmental Variables.

Statistic	Power DC (W)	DC Current (A)	DC Voltage (V)	Irradiance (W/m²)	Module Temp (°C)
Count	18,060	18,060	18,060	18,060	18,060
Mean	2900.358	4.796	604.047	617.952	53.049
Std. Dev.	1127.706	1.948	22.032	279.024	16.021
Min	38.760	0.214	241.514	0.000	6.906
25%	2008.641	3.214	588.550	408.694	41.488
Median	3028.537	4.962	604.650	661.564	52.636
75%	3835.661	6.424	619.264	856.044	63.684
Max	5314.706	8.936	742.544	1154.478	98.486

Table 3. Descriptive Statistics of Soiling Indicators.

Statistic	Soiling Ratio1	Soiling Loss 1	Soiling Ratio 2	Soiling Loss 2	Soiling Ratio AVG	Soiling Loss AVG
Count	18,060	18,060	18,060	18,060	18,060	18,060
Mean	90.134	9.866	94.385	5.615	92.134	7.866
Std. Dev.	6.725	6.725	5.769	5.769	5.481	5.481
Min	70.500	0.000	72.800	0.000	74.050	0.750
25%	88.308	3.579	92.540	1.400	89.708	3.580
Median	91.800	7.300	95.000	5.000	92.700	7.300
75%	95.100	11.694	98.600	7.460	96.420	10.292
Max	99.200	29.500	100.000	27.200	99.250	25.950

Table 4. Weather Parameters.

Statistic	Wind Speed (m/s)	Humidity (%)	Ambient Temp (°C)
Mean	9.119	39.344	41.795
Std. Dev.	5.710	21.790	13.997
Min	0.316	4.000	11.760
Max	28.942	98.500	82.300

Table 5. Summary of Engineered Feature Categories.

Category	Purpose	Section
Environmental Interaction Features	Capture nonlinear effects between environmental variables (e.g., irradiance–temperature coupling)	Section 4.6.1
Performance Normalization and Degradation Indicators	Isolate soiling effects from environmental variability	Section 4.6.2
Soiling Metrics and Rolling Statistics	Quantify soiling accumulation and detect rapid deposition events	Section 4.6.3
Temporal and Cyclical Features	Model daily, seasonal, and event-driven patterns	Section 4.6.4
Final Feature Set for LightGBM	Integrate all features for classification	Section 4.6.5

Table 6. Overview of Data Categories, Acquisition Sources, Functional Usage, and Simulation Status in the Two-Stage Soiling Analysis Framework.

Data Category	Source	Used For	Simulated?
Environmental variables	Physical sensors	Inputs to Stage 1 & Stage 2	No
Clean-period power	Physical sensors during known clean windows	Stage 1 Model Training	No
Soiled power (all periods)	Physical sensors	Stage 2 Feature Calculation ( $P_{s o i l e d}$ )	No
Cleaning logs	Manual/automated records	Identifying clean periods, validation	No
Simulated clean power	Stage 1 Model Output	Stage 2 Feature Calculation ( ${\hat{P}}_{c l e a n}$ )	Yes
Soiling Loss Index (SLI)	Calculated: $1 - P_{s o i l e d} / {\hat{P}}_{c l e a n}$	Stage 2 Input Feature	No (derived)
Cleaning Demand Flag (CDF)	Stage 2 Output	Final Maintenance Decision	No (decision)

Table 7. Regression Model Categories and Selection Rationale for Clean-Power Baseline Generation.

Category	Model	Regularization/Loss	Primary Purpose in Framework	Selection Rationale
Latent-Variable Regression	Partial Least Squares Regression (PLSR)	Supervised latent projection	Clean-condition power prediction (Stage 1); SLI and CDF derivation	Projects correlated environmental variables into latent components that maximize covariance with DC power, yielding stable and accurate synthetic baselines from both real and regression-based simulated data.
Latent-Variable Regression	Principal Component Regression (PCR)	PCA-based dimensionality reduction	Benchmark clean-condition power prediction (Stage 1)	Reduces multicollinearity and noise by transforming predictors into orthogonal components; serves as an unsupervised comparison baseline relative to PLSR.
Regularized Linear Regression	Ridge Regression	L2 penalty	Stable clean-power estimation (Stage 1)	Shrinks coefficients uniformly to mitigate multicollinearity while retaining all predictors, improving numerical stability and robustness.
Regularized Linear Regression	Lasso Regression	L1 penalty	Sparse clean-power modeling (Stage 1)	Performs automatic feature selection by driving weak predictors to zero, enabling interpretability and assessment of predictor relevance consistency between real and synthetic datasets.
Regularized Linear Regression	Elastic Regression	Combined L1 + L2 penalties	Balanced sparsity and stability (Stage 1)	Addresses grouped and correlated predictors by balancing coefficient shrinkage and feature selection, improving robustness in high-dimensional PV datasets.
Regularized Linear Regression	Elastic Net	Tunable L1/L2 mixing parameter	Adaptive clean-power modeling (Stage 1)	Provides flexible regularization for complex engineered features and multicollinearity, enhancing synthetic baseline fidelity.

Table 8. Training Protocol and Evaluation Framework for Stage-1 Power Prediction Models.

Component	Description
Objective	Predict clean-condition DC power to generate synthetic baselines for soiling loss estimation and downstream cleaning classification.
Input Features	Plane-of-array irradiance ( $G_{P O A}$ ), module temperature ( $T_{m o d u l e}$ ), ambient temperature ( $T_{a m b i e n t}$ ), wind speed, relative humidity, and engineered interaction terms including $Δ T = T_{m o d u l e} - T_{a m b i e n t}$ and $G_{P O A} \times T_{m o d u l e}$
Target Variable	Measured DC power from the clean reference string, used as ground truth during training.
Dataset Size and Resolution	>235,000 observations recorded at 5 min intervals.
Data Splitting Strategy	Chronological split to preserve temporal dependencies: 70% training (~164,500 samples), 15% validation (~35,250 samples), 15% testing (~35,250 samples).
Feature Standardization	Z-score normalization applied to all continuous predictors (mean = 0, standard deviation = 1) to ensure scale consistency across variables.
Cross-Validation Method	5-fold time-series cross-validation performed on the training set.
Optimization Metric	Mean Squared Error (MSE).
Hyperparameter Ranges	PLSR/PCR: components = [2–6]; Ridge/Lasso: λ = [0.001–10]; Elastic Net/Elastic Regression: α = [0.1–0.9], λ = [0.001–1]; Robust Regression: Huber threshold = [1.0–2.0].
Convergence Criteria	Iterative models trained until loss change < 1 × 10⁻⁴ or maximum of 1000 iterations reached.
Regularization and Loss Functions	Ridge (L2 penalty), Lasso (L1 penalty), Elastic models (L1 + L2 penalties), Robust Regression (Huber loss with threshold = 1.35).
Evaluation Metrics	Correlation coefficient (R), coefficient of determination (R²), Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), Mean Absolute Percentage Error (MAPE).
Output Usage	Clean-power predictions used to compute the Soiling Loss Index (SLI) and inform the binary Cleaning Demand Flag (CDF) in Stage-2 classification.

Table 9. Engineered Features for LightGBM Cleaning Classification.

Feature Category	Feature(s)	Units/Encoding	Description
Primary Performance	${\hat{P}}_{clean}, P_{soiled}, SLI = 1 - \frac{P_{soiled}}{{\hat{P}}_{clean}}$	W/normalized	Predicted clean power, measured soiled power, and normalized soiling loss
Temporal/Cyclical	Hour (cyclical), Day-of-Year (cyclical), Month (cyclical), Time since last cleaning/rainfall	h, d, month, h	Daily, seasonal, and event-driven patterns affecting soiling accumulation
Time-based/Soiling	Mean SLI (1 h, 3 h, 24 h), Standard deviation (1 h, 3 h, 24 h), ΔSLI/Δt	normalized/h	Non-linear thermodynamic effects: sunlight temperature coupling, module heating, wind–humidity interactions
Environmental Interactions	$G_{POA} \times T_{module}, Δ T = T_{module} - T_{ambient}, H \times W$	W/m²·°C, °C, %·m/s	Non-linear thermodynamic effects: sunlight-temperature coupling, module heating, wind–humidity interactions
Power Indicators	Voltage and current changes	V, A	Early-stage electrical degradation; distinguishes soiling (current drop) from faults (voltage drop)

Table 10. Overall Impact on Cleaning Decision Consistency.

Aspect	Observation	Implication
Alert Count Difference	0	No operational divergence
Alert Rate Difference	0.00%	Cleaning schedules remain unchanged
Temporal Pattern Matching	Strong	Daily and seasonal trends preserved
Misclassification Risk	Minimal	Low false-cleaning and missed-cleaning events

Table 11. Classification Performance Comparison (Cleaning Decision Framework).

Model	Dataset	Accuracy	Precision	Recall	F1-Score	AUC
PLS	Real	0.9661	0.9800	0.9783	0.9792	0.992
PLS	Simulated	0.9661	0.9800	0.9783	0.9792	0.991
Elastic Net	Real	0.9661	0.9800	0.9783	0.9792	0.991
Elastic Net	Simulated	0.9661	0.9800	0.9783	0.9792	0.991
LASSO	Real	0.9661	0.9800	0.9783	0.9792	0.990
LASSO	Simulated	0.9661	0.9800	0.9783	0.9792	0.990
PCR	Real	0.9661	0.9800	0.9783	0.9792	0.993
PCR	Simulated	0.9661	0.9800	0.9783	0.9792	0.992
Ridge	Real	0.9660	0.9798	0.9780	0.9789	0.990
Ridge	Simulated	0.9660	0.9798	0.9780	0.9789	0.990
Robust	Real	0.9658	0.9795	0.9777	0.9786	0.989
Robust	Simulated	0.9657	0.9794	0.9775	0.9785	0.989

Table 12. Electrical Prediction Accuracy (Real vs. Regression-based simulated data).

Power Parameter	Mean Difference Range	Percentage Difference	R² Range	Interpretation
DC Power (W)	47.7–298.6	1.8–11.3%	0.93–0.95	Very strong agreement; dominant driver of cleaning decisions
DC Current (A)	0.09–0.78	2.2–18.3%	0.80–0.95	Slightly higher variability due to environmental sensitivity
DC Voltage (V)	1.9–9.9	0.3–1.6%	0.25–0.29	Moderate correlation; limited operational impact

Table 13. R² Analysis by Model (Real vs. Regression-based simulated data).

Model	Power DC (R²)	Current (R²)	Voltage (R²)	Notes
PLS	0.9481	0.9020	0.270	Excellent latent-variable interpretability
Elastic Net	0.9453	0.8991	0.268	Stable with L1+L2 regularization
LASSO	0.9318	0.9179	0.272	Sparse solution; interpretable
PCR	0.9545	0.9520	0.285	High reconstruction accuracy via orthogonal PCs
Ridge	0.9460	0.9000	0.269	Robust to multicollinearity
Robust	0.9400	0.8950	0.267	Resistant to outliers

Table 14. Additional Performance Metrics for Regression Models.

Model	MSE (DC Power, W²)	MSE (DC Current, A²)	MSE (DC Voltage, V²)	AIC	BIC	Interpretation
PLS Regression	641	19.7	1.2	115,200	115,320	High predictive fidelity; latent-variable interpretability
Elastic Net	671	20.2	1.3	115,260	115,380	Balanced sparsity and stability; robust for synthetic data
LASSO	676	20.3	1.3	115,300	115,420	Sparse solution; slight accuracy reduction
PCR	734	21.0	1.3	115,350	115,470	Very high reconstruction accuracy; unsupervised latent PCs
Ridge	666	20.1	1.3	115,250	115,370	Stable estimation; handles multicollinearity well
Robust	702	21.0	1.3	115,310	115,430	Resistant to outliers; slightly higher error

Table 15. Stage 1: Regression Model Performance for Clean-Power Prediction.

Model	Dataset	R²	RMSE [W]	MSE [W²]	MAE [W]	AIC	BIC	Key Observation
PLSR	Real	0.982	25.3	641	19.7	115,200	115,320	Excellent fit, supervised latent projection
PLSR	Simulated	0.981	25.6	655	20.0	115,250	115,370	Maintains predictive accuracy on synthetic data
PCR	Real	0.978	27.1	734	21.0	115,350	115,470	Unsupervised latent representation
PCR	Simulated	0.977	27.4	751	21.3	115,400	115,520	Slightly higher error on synthetic data
Ridge	Real	0.981	25.8	666	20.1	115,250	115,370	Stable estimation, handles multicollinearity
Ridge	Simulated	0.980	26.1	682	20.4	115,300	115,420	Good robustness for synthetic baseline
Lasso	Real	0.981	26.0	676	20.3	115,300	115,420	Sparse solution, interpretable coefficients
Lasso	Simulated	0.980	26.3	692	20.6	115,350	115,470	Retains sparsity with slight accuracy drop
Elastic Net	Real	0.981	25.9	671	20.2	115,260	115,380	Balanced sparsity and stability
Elastic Net	Simulated	0.980	26.2	687	20.5	115,310	115,430	Consistent performance for synthetic data
Robust	Real	0.980	26.5	702	21.0	115,310	115,430	Resistant to outliers
Robust	Simulated	0.979	26.8	719	21.3	115,360	115,480	Slightly higher error on synthetic data

Table 16. Model-Level Comparative Performance Summary.

Criterion	PLS	Elastic Net	LASSO	PCR	Ridge	Robust
Multicollinearity	Excellent	Excellent	Good	Excellent	Excellent	Moderate
Interpretability	High	Med	Med	Low–Med	Med	Med
Power DC Accuracy	High	High	Med	Very High	High	Med–High
Current Accuracy	High	High	High	Very High	High	High
Voltage Sensitivity	Low	Low	Low	Low	Low	Low
Prediction Stability	High	Very High	Med	Very High	Very High	Med
Sensitivity to Noise	Med	Low	Med–High	Low	Low	Low
Operational Robustness	High	Very High	Med	Very High	High	Very High
Automation Suitability	High	Very High	Med	High	High	Med–High

Table 17. Summary of Model Roles and Integration.

Stage	Model Type	Primary Function	Output
Stage 1	Statistical Regression (7 models)	Predict clean-condition power	Continuous $P_{d c}$ values
Stage 2	Machine Learning Classification (5 + models)	Classify cleaning necessity	Binary CDF (0 or 1)
Integration	Two-stage pipeline	Convert power predictions to maintenance decisions	Cleaning schedules

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Al-Humairi, A.; Khalis, E.; Hemyari, Z.A.A.; Jung, P. A Two-Stage Comparative Framework for Predicting Photovoltaic Cleaning Schedules: Modeling and Comparisons Based on Real and Simulated Data. Appl. Sci. 2026, 16, 2976. https://doi.org/10.3390/app16062976

AMA Style

Al-Humairi A, Khalis E, Hemyari ZAA, Jung P. A Two-Stage Comparative Framework for Predicting Photovoltaic Cleaning Schedules: Modeling and Comparisons Based on Real and Simulated Data. Applied Sciences. 2026; 16(6):2976. https://doi.org/10.3390/app16062976

Chicago/Turabian Style

Al-Humairi, Ali, Enmar Khalis, Zuhair A. Al Hemyari, and Peter Jung. 2026. "A Two-Stage Comparative Framework for Predicting Photovoltaic Cleaning Schedules: Modeling and Comparisons Based on Real and Simulated Data" Applied Sciences 16, no. 6: 2976. https://doi.org/10.3390/app16062976

APA Style

Al-Humairi, A., Khalis, E., Hemyari, Z. A. A., & Jung, P. (2026). A Two-Stage Comparative Framework for Predicting Photovoltaic Cleaning Schedules: Modeling and Comparisons Based on Real and Simulated Data. Applied Sciences, 16(6), 2976. https://doi.org/10.3390/app16062976

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Two-Stage Comparative Framework for Predicting Photovoltaic Cleaning Schedules: Modeling and Comparisons Based on Real and Simulated Data

Abstract

1. Introduction

2. Literature Review

3. Methodology

4. Dataset and Preprocessing

4.1. Data Sources

4.2. Simulated Clean-Power Datasets

4.3. Data Cleaning and Temporal Alignment

4.3.1. Initial Data Inspection and Correlation Analysis

4.3.2. Temporal Standardization and Numerical Integrity Checks

4.4. Scaling and Normalization

4.5. Statistical Summary of the Cleaned Dataset

4.6. Feature Engineering

4.6.1. Environmental Interaction Features

4.6.2. Performance Normalization and Degradation Indicators

4.6.3. Soiling Metrics and Rolling Statistics

4.6.4. Temporal and Cyclical Features

4.6.5. Final Feature Set for LightGBM

4.7. Data Provenance and Simulation Boundaries

4.7.1. Purely Measured Data

4.7.2. Training Data for Stage 1 Regression Models

4.7.3. Regression-Derived Simulated Data

4.7.4. Summary Table: Data Categories

4.7.5. Independence of Stages and Absence of Feedback Loops

5. Methods

5.1. Stage 1: Regression Models for Clean-Power Prediction

5.1.1. Model Selection

5.1.2. Training Protocol

5.2. Stage 2: Machine Learning-Based Cleaning Classification

5.3. Evaluation Metrics

5.3.1. Soiling Loss Index and Cleaning Demand Flag

5.3.2. Regression Model Performance Metrics

5.3.3. Classification Performance Metrics

5.3.4. Temporal Alignment Evaluation

5.3.5. Comparative Evaluation Framework

5.3.6. Note on Adaptive Thresholds

6. Results

6.1. Cleaning Alert Agreement Between Real and Simulated Data

6.2. Classification Performance

6.3. Power Prediction Accuracy

6.4. Correlation Analysis by Model

6.5. Additional Performance Metrics (MSE, AIC, BIC)

6.6. Model-Level Performance Comparison

7. Discussion

7.1. Interpretation of Key Findings

7.2. Model Performance Synthesis

7.3. Implications for PV Maintenance

7.4. Comparison with Previous Studies

7.5. Practical Implications

8. Conclusions and Future Work

8.1. Summary of Contributions

8.2. Key Findings

8.3. Limitations

8.4. Future Work

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A

Appendix A.1. Coefficient of Determination ( R 2 )

Appendix A.2. Regression Error Metrics

Appendix A.3. Information Criteria

Appendix A.4. Classification Metrics

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Appendix A.1. Coefficient of Determination ( $R^{2}$ )