Improving Photovoltaic Power Prediction: Insights through Computational Modeling and Feature Selection

: This work identifies the most effective machine learning techniques and supervised learning models to estimate power output from photovoltaic (PV) plants precisely. The performance of various regression models is analyzed by harnessing experimental data, including Random Forest regressor, Support Vector regression (SVR), Multi-layer Perceptron regressor (MLP), Linear regressor (LR), Gradient Boosting, k-Nearest Neighbors regressor (KNN), Ridge regressor (Rr), Lasso regressor (Lsr), Polynomial regressor (Plr) and XGBoost regressor (XGB). The methodology applied starts with meticulous data preprocessing steps to ensure dataset integrity. Following the preprocessing phase, which entails eliminating missing values and outliers using Isolation Feature selection based on a correlation threshold is performed to identify relevant parameters for accurate prediction in PV systems. Subsequently, Isolation Forest is employed for outlier detection, followed by model training and evaluation using key performance metrics such as Root-Mean-Squared Error (RMSE), Normalized Root-Mean-Squared Error (NRMSE), Mean Absolute Error (MAE), and R-squared (R 2 ), Integral Absolute Error (IAE), and Standard Deviation of the Difference (SDD). Among the models evaluated, Random Forest emerges as the top performer, highlighting promising results with an RMSE of 19.413, NRMSE of 0.048%, and an R 2 score of 0.968. Furthermore, the Random Forest regressor (the best-performing model) is integrated into a MATLAB application for real-time predictions, enhancing its usability and accessibility for a wide range of applications in renewable energy.


Introduction
In the global pursuit of net-zero emissions, every country has committed to vigorously advancing clean energy initiatives.Among these efforts, PV energy production stands out as a crucial and rapidly developing sustainable energy source, playing a vital role in ensuring electrical systems' safe, stable, and cost-effective operation.However, the inherently variable nature of PV energy production, influenced by seasonal fluctuations, meteorological conditions, diurnal changes, and solar radiation intensity, presents significant challenges to the reliable integration of large-scale PV grids into the electricity system [1][2][3][4].Accurate predictions of PV electricity production capacity are therefore essential for developing power generation plans, optimizing power dispatching, and promoting the adoption of new energy sources, ultimately reducing operational costs and enhancing system stability.
There is a strong interest in predicting and forecasting energy production in multisource systems, evaluating the power output of each component, and estimating energy generation under diverse climatic and operational conditions [5].Various methodologies for predicting photovoltaic (PV) energy systems exist, with some studies employing neural networks for energy generation prediction [6][7][8].Different prediction models have emerged, which can be classified based on criteria such as linearity or mathematical approach [9].These classifications divide models into linear and nonlinear categories based on Artificial Intelligence techniques versus regressive models [10].

Models Based on Past Values
These models rely solely on past values as inputs, which can either be the variable to be predicted or that variable supplemented with other influential variables.These influential variables might include those relevant to the specific time they occurred and locally measured meteorological variables from those past moments.These models can be broadly categorized as described in the following subsections.

Persistence Models
When estimating the energy production of a photovoltaic system, one must use the power production recorded at the same time on a previously measured day of operation based only on historical records.The main application of this prediction method is performance benchmarking or comparisons with other modeling techniques [10].

Statistical Approaches
These PV prediction methods use time series analysis to understand observed data series behavior or forecast future values.These methods are beneficial for short-term PV power production estimates.The following techniques are commonly used in statistical approaches: Regression models: Here, PV power output is treated as a dependent variable explained by meteorological variables.These models usually require mathematical formulas and consider explanatory variables [11].
Auto-regressive models: Techniques such as ARMA (Auto-Regressive Moving Average) and ARIMA (Auto-Regressive Integrated Moving Average) are frequently used for PV prediction using time series.These techniques assume that past values of the series (the series' history) influence future values through a combination of Auto-Regressive (AR) and Moving Average (MA) elements.In a pure Auto-Regressive process, future values of the series depend solely on past values.In Moving Average processes, future values depend on random variables independent of one another and are modeled as white noise [12].

AI Techniques
These models are based on Artificial Intelligence approaches (machine learning and deep learning).Often, these methods require a large volume of data to estimate PV energy production accurately [13,14].

Hybrid Models
These models integrate physical and statistical approaches to improve the accuracy of PV power estimation by leveraging the strengths of both methods.For instance, neurofuzzy systems combine the supervised learning ability of neural networks with the knowledge representation of fuzzy inference systems.A common term for such systems is Adaptive Neuro-Fuzzy Inference Systems (ANFIS), applied to PV power estimation [15].Other examples of hybrid models include the use of neural networks optimized with genetic algorithms, ARMA models combined with neural networks, the integration of various types of neural networks, and the combination of atmospheric models like MM5 for radiation prediction with fuzzy logic or neural networks for power prediction [16].

2.
Physical Models use detailed physical principles and environmental conditions to estimate PV energy production These models generally require input data such as solar radiation, temperature, and other climatic factors.Standard physical models include radiative transfer, thermal, geographical information systems (GIS), and engineering models [17].
Recent efforts in predicting and forecasting PV generation have focused on various modeling approaches, including physical models, statistical analysis models, Artificial Intelligence (AI) models, and hybrid models [17][18][19].
Physical models rely on geographic and meteorological data to compute PV power, considering solar radiation, humidity, and temperature factors.However, modeling complexities arise from the need for detailed geographic and meteorological data specific to PV plants to anticipate production accurately.
On the other hand, statistical models capture historical time series relationships, often utilizing autoregressive Moving Average models.These autoregressive integrated Moving Average models and similar techniques are known for their simplicity and computational efficiency.Yet, these models are best suited for stable time series data, whereas actual PV data exhibit high variability and significant errors [20].
The advent of smart metering technologies has provided abundant real-world data, opening new ways for machine learning and deep learning techniques to enhance datadriven algorithms for PV power generation forecasting.Moreover, integrating smart meters and data processing capabilities offers novel opportunities to improve the accuracy and reliability of PV production forecasts.By leveraging these advancements, researchers aim to develop more robust and effective prediction models capable of meeting the evolving needs of the renewable energy sector.Due to their potential for extracting representative features and data mining, AI-based models have proven to be more successful than physical and statistical ones [21].
In recent years, conventional machine learning algorithms have emerged as powerful tools for forecasting PV power generation.Demand response, proactive maintenance, energy production, and load predicting are just a few applications where machine learning models are the go-to toolkit for researchers [22].These models can capture complex nonlinear relationships between various factors influencing power generation and accurately predicting future values [23].The use of deep learning, nevertheless, can be useful when dealing with time series data.
Auto-Regressive Integrated Moving Averages (ARIMA) methods are adequate for the instantaneous forecasting of robust time series data.However, artificial neural networks (ANNs) are significantly more potent than ARIMA models and traditional quantitative approaches, especially for modeling complex interactions [24].Due to their ability to handle nonlinear models, ANNs have increasingly become popular for forecasting time series data in recent years [25].
This study contributes significantly to the field by advancing predictive modeling techniques for the renewable energy sector and providing valuable insights for optimizing PV systems and their management.Key contributions include using Pearson and Spearman correlation analyses to identify influential environmental variables and enhancing model interpretability and performance.Integration of Isolation Forest for outlier detection during data preprocessing ensures the removal of anomalies, thereby improving the model's generalization ability and preventing overfitting.Furthermore, the adoption of Randomized Search CV streamlines hyperparameter tuning, with Random Forest emerging as the optimal model choice due to its ensemble nature and capability to capture nonlinear relationships, which are crucial for modeling the complex dynamics of PV system generation.Additionally, the integration of Python-trained (version 3.8.0)models into a MATLAB 2023 interface represents a significant advancement in accurately predicting key parameters such as PV generation, PDC, VDC, and IDC.Moreover, this interface extends beyond mere prediction by incorporating calculations for evaluating yield, losses, and performance ratios (PR), enabling a comprehensive assessment of system performance and health.This thorough analysis capability offers valuable insights for optimizing efficiency and addressing potential issues in PV systems.
The paper is structured as follows: Section 2 introduces the PV dataset used in this study, outlining the various environmental variables and parameters pertinent to PV systems.This section also describes data preprocessing techniques, detailing the strategies for refining sensor data and emphasizing the importance of cleaning and normalization for ensuring data accuracy and reliability.The use of Pearson and Spearman correlation analyses to identify significant environmental variables for predictive modeling is also detailed in this section.The approach to enhancing regression model performance through outlier detection using Isolation Forest during data preprocessing is also discussed.The methodology provides an overview of the regression models employed for predicting key parameters of PV systems while outlining our hyperparameter tuning process using Randomized Search CV, and the evaluation metrics utilized to optimize model performance are also analyzed in Section 2. The development of a MATLAB application for power prediction, highlighting the integration of Python-trained models and the interface's capabilities for accurate prediction and system performance evaluation, is also presented in Section 2. Section 3 presents the results and their implications for the renewable energy sector and suggests potential avenues for future research.Finally, Section 4 focuses on the discussion of the main results obtained.

Materials and Methods
The data collected in this study are from a grid-connected, ground-mounted PV system in Ain El-Melh, located in the Algerian highlands and serving as the gateway to the vast desert.The site's coordinates are 34    The PV modules are linked to 500 kW inverter cabinets via junction boxes, serving as the primary data source.Data gathering occurred from 1 January 2020, to 31 December 2021, with readings taken every fifteen minutes, resulting in 69,195 data points.This dataset encompasses parameters such as solar panel temperature, tilt radiation, total radiation, dispersion radiation, direct radiation, wind speed, humidity, pressure, voltage, current, and PV power.
Table 2 shows an overview of the environmental and electrical parameters of the PV system.Data preprocessing is essential when working with the actual data collected from automatic sensors, as these data often contain errors and inconsistencies.Cleaning and organizing techniques are applied to prepare the data for use with machine learning models.The focus is correcting minor inconsistencies and removing erroneous or missing data from the monitoring dataset.
One challenge encountered is the presence of empty records, particularly during nighttime (between 9 p.m. and 4 a.m.) when no measurements are collected.While solar irradiation is naturally zero at night, air temperature data may still be missing.However, the absence of nighttime temperature data is irrelevant since there is no PV power production.Including nighttime data would only add redundant information, increasing the model complexity and calculation time without yielding meaningful results.To prevent the negative impact of empty records on learning models, rows containing null data are eliminated.The same procedure is applied to remove duplicated values or incomplete records.
After these preprocessing steps, the database ultimately contains 33,465 samples.The min-max normalization method optimizes the model's performance and ensures data homogeneity.This process scales each data point to a range between 0 and 1.The equation for calculating the normalized value x norm for a given value x is This normalization technique serves various purposes, including speeding up the optimization process, minimizing disparities between data values, removing dimensional influences, and reducing computational requirements.
The analysis examined correlation factors to ascertain the relationships among P DC and individual weather factors.The correlation coefficient, denoted as r, indicates the degree of association between two variables, x i and y i , and is expressed as follows [26][27][28]: (2) By applying Equations ( 3) and (4) to Equation (2), the following equation can be driven: where {x i , y i } and n are the mean and sample size, respectively, and {x i , y i } are the individual sample points indexed by i.Two methods are used for estimating the correlation and correlation coefficients between two variables: Pearson and Spearman.The Pearson method assesses the linear relationship between variables, indicating a proportional change between them.Conversely, the Spearman method evaluates a simple (ordinal or rank) relationship, where variables tend to change together without necessarily being proportional.
This study employed the Pearson correlation method to analyze the relationship between P DC and environmental variables.Figure 1 illustrates the outcomes of this correlation analysis in the heatmap histograms in the diagonal plots demonstrating the frequency distributions of P DC and environmental data.
This normalization technique serves various purposes, including speeding up the optimization process, minimizing disparities between data values, removing dimensional influences, and reducing computational requirements.
The analysis examined correlation factors to ascertain the relationships among PDC and individual weather factors.The correlation coefficient, denoted as r, indicates the degree of association between two variables,  and  , and is expressed as follows [26][27][28]: By applying Equations ( 3) and (4) to Equation (2), the following equation can be driven: where { ,  } and n are the mean and sample size, respectively, and { , } are the individual sample points indexed byi.Two methods are used for estimating the correlation and correlation coefficients between two variables: Pearson and Spearman.The Pearson method assesses the linear relationship between variables, indicating a proportional change between them.Conversely, the Spearman method evaluates a simple (ordinal or rank) relationship, where variables tend to change together without necessarily being proportional.
This study employed the Pearson correlation method to analyze the relationship between PDC and environmental variables.Figure 1 illustrates the outcomes of this correlation analysis in the heatmap histograms in the diagonal plots demonstrating the frequency distributions of PDC and environmental data.The correlation matrix offers insights into the relationships between PV power generation, voltage, current, and environmental variables.Each cell in the matrix presents the correlation coefficient between two variables, ranging from −1 to 1.The sign of the coefficient indicates the direction of the relationship: "+" denotes a positive correlation, and "−" represents a negative correlation.A higher absolute correlation coefficient value signifies a stronger association between the variables [29,30].
Several noteworthy patterns were observed when analyzing the correlations.Variables such as tilt solar radiation, Gdin, total Irradiance, Gtotal, direct solar radiation, and Gdirect exhibit strong positive correlations with PV power generation P DC , indicating that higher values of these environmental factors tend to coincide with increased PV power generation.Conversely, the variable H, representing humidity, demonstrates a notable negative correlation with PV power generation, suggesting that higher humidity levels may lead to decreased PV power output.Additionally, some variables-such as Tp, the temperature of the PV panel, and Gdisp, dispersed solar radiation-show moderate positive correlations with PV power generation.These correlations imply that temperature and dispersed solar radiation may also significantly influence PV power generation, albeit to a lesser extent than other factors like direct solar radiation Gdirect.
Moreover, variables such as V_V, wind speed, and P, pressure, exhibit weaker correlations with PV power generation, as indicated by their correlation coefficients close to zero.While these variables may still influence PV power generation, their impact appears to be relatively minor compared to other environmental factors.
Overall, this correlation analysis provides valuable insights into how various environmental variables relate to PV power generation.Understanding these relationships can inform decision-making processes for optimizing PV system performance, forecasting energy production, and designing more efficient renewable energy systems.
The target variable P DC is defined after loading the dataset, removing any rows with missing values, and eliminating the outliers.Then, we compute Pearson and Spearman correlation coefficients separately with the target variable.The correlation coefficients from both methods were combined by selecting the maximum absolute value.After that, the features whose absolute correlation coefficients with the target variable are less than or equal to 0.1 were filtered.This process selects a subset of the original features that meet the correlation criterion.The number of input features remains the same; we do not remove any features from the dataset itself but identify which features are relevant based on the correlation threshold.This approach ensures that significant correlations are captured regardless of the method used.Figure 2 demonstrates feature selection based on the correlation threshold for P DC data, identifying pertinent features crucial for accurate prediction and analysis in PV systems.
Energies 2024, 17, x FOR PEER REVIEW 7 of 23 The correlation matrix offers insights into the relationships between PV power generation, voltage, current, and environmental variables.Each cell in the matrix presents the correlation coefficient between two variables, ranging from −1 to 1.The sign of the coefficient indicates the direction of the relationship: "+" denotes a positive correlation, and "-" represents a negative correlation.A higher absolute correlation coefficient value signifies a stronger association between the variables [29,30].
Several noteworthy patterns were observed when analyzing the correlations.Variables such as tilt solar radiation, Gdin, total Irradiance, Gtotal, direct solar radiation, and Gdirect exhibit strong positive correlations with PV power generation PDC, indicating that higher values of these environmental factors tend to coincide with increased PV power generation.Conversely, the variable H, representing humidity, demonstrates a notable negative correlation with PV power generation, suggesting that higher humidity levels may lead to decreased PV power output.Additionally, some variables-such as Tp, the temperature of the PV panel, and Gdisp, dispersed solar radiation-show moderate positive correlations with PV power generation.These correlations imply that temperature and dispersed solar radiation may also significantly influence PV power generation, albeit to a lesser extent than other factors like direct solar radiation Gdirect.
Moreover, variables such as V_V, wind speed, and P, pressure, exhibit weaker correlations with PV power generation, as indicated by their correlation coefficients close to zero.While these variables may still influence PV power generation, their impact appears to be relatively minor compared to other environmental factors.
Overall, this correlation analysis provides valuable insights into how various environmental variables relate to PV power generation.Understanding these relationships can inform decision-making processes for optimizing PV system performance, forecasting energy production, and designing more efficient renewable energy systems.
The target variable PDC is defined after loading the dataset, removing any rows with missing values, and eliminating the outliers.Then, we compute Pearson and Spearman correlation coefficients separately with the target variable.The correlation coefficients from both methods were combined by selecting the maximum absolute value.After that, the features whose absolute correlation coefficients with the target variable are less than or equal to 0.1 were filtered.This process selects a subset of the original features that meet the correlation criterion.The number of input features remains the same; we do not remove any features from the dataset itself but identify which features are relevant based on the correlation threshold.This approach ensures that significant correlations are captured regardless of the method used.Figure 2 demonstrates feature selection based on the correlation threshold for PDC data, identifying pertinent features crucial for accurate prediction and analysis in PV systems.Isolation Forest is a popular algorithm used for outlier detection in machine learning.It isolates anomalies in the dataset rather than modeling the normal data points.This approach is particularly effective for high-dimensional datasets with complex structures.The main principle behind Isolation Forest is that anomalies are typically rare and have attributes that make them easy to isolate.The algorithm exploits this principle by randomly selecting a feature and then randomly selecting a split value between the maximum and minimum values of that feature.This process recursively occurs until all data points are isolated or a predefined maximum tree depth is reached.During the isolation process, anomalies are expected to be isolated with fewer splits than normal data points.Therefore, the path length to isolate an anomaly is typically shorter than that of a normal data point.By measuring the average path length across multiple isolation trees, Isolation Forest assigns anomaly scores to each data point.Data points with shorter average path lengths are considered more anomalous.
This work uses an Isolation Forest for outlier detection before training the regression models.Specifically, after loading and preprocessing the dataset, Isolation Forest is applied to detect and remove outliers using the Isolation Forest class from the sklearn library.After setting the contamination parameter, representing the expected proportion of outliers in the dataset, outlier predictions are then used to filter out the outliers from the original dataset, resulting in a cleaned dataset containing only the inlier data points.In Figure 3, we present the distributions of P DC before and after removing outliers.The X-axis represents PV generation values, while the Y-axis represents the frequency of occurrence.By comparing the two distributions, we gain insights into how removing outliers affects the overall distribution of PV generation values.
Energies 2024, 17, x FOR PEER REVIEW 8 of Isolation Forest is a popular algorithm used for outlier detection in machine lear ing.It isolates anomalies in the dataset rather than modeling the normal data points.Th approach is particularly effective for high-dimensional datasets with complex structur The main principle behind Isolation Forest is that anomalies are typically rare and ha attributes that make them easy to isolate.The algorithm exploits this principle by ra domly selecting a feature and then randomly selecting a split value between the ma mum and minimum values of that feature.This process recursively occurs until all da points are isolated or a predefined maximum tree depth is reached.During the isolati process, anomalies are expected to be isolated with fewer splits than normal data poin Therefore, the path length to isolate an anomaly is typically shorter than that of a norm data point.By measuring the average path length across multiple isolation trees, Isolati Forest assigns anomaly scores to each data point.Data points with shorter average pa lengths are considered more anomalous.
This work uses an Isolation Forest for outlier detection before training the regressi models.Specifically, after loading and preprocessing the dataset, Isolation Forest is a plied to detect and remove outliers using the Isolation Forest class from the sklearn brary.After setting the contamination parameter, representing the expected proportion outliers in the dataset, outlier predictions are then used to filter out the outliers from t original dataset, resulting in a cleaned dataset containing only the inlier data points.Figure 3 allows us to compare the distribution of the target variables PDC before a after removing outliers, providing insights into the impact of outlier removal on the da distribution.By identifying and removing outliers, Isolation Forest effectively isola anomalous data points that may skew the distribution of the variables.By taking o many data points classified as outliers, Isolation Forest helps ensure that the resulti histograms accurately represent the distribution of normal data points within each b This enables a clearer understanding of how the data are distributed and how removi outliers affects the overall data distribution [31].
The methodology employed in this study began with thorough data preprocessi steps to ensure the integrity of the dataset.Missing values were addressed through eith imputation or removal and through relevant features highly correlated with the targ   3 allows us to compare the distribution of the target variables P DC before and after removing outliers, providing insights into the impact of outlier removal on the data distribution.By identifying and removing outliers, Isolation Forest effectively isolates anomalous data points that may skew the distribution of the variables.By taking out many data points classified as outliers, Isolation Forest helps ensure that the resulting histograms accurately represent the distribution of normal data points within each bin.This enables a clearer understanding of how the data are distributed and how removing outliers affects the overall data distribution [31].
The methodology employed in this study began with thorough data preprocessing steps to ensure the integrity of the dataset.Missing values were addressed through either imputation or removal and through relevant features highly correlated with the target variable.Correlated features were identified using Pearson and Spearman correlation coefficients, and then, both sets of correlated features were merged.Additionally, outlier detection and removal were performed using Isolation Forest to enhance the robustness of the models.Subsequently, the data were split into training and testing sets for model evaluation.Following data preprocessing, ten regression models were selected for eval-uation: k-Nearest Neighbors (KNN) [32], Support Vector regression (SVR) [33], Random Forest [34], Multi-layer Perceptron (MLP), Linear regressor (LR), Gradient Boosting [35], Ridge regressor(Rr) [36], Lasso regressor (Lsr) [37], Polynomial regressor (PLR) [38], and XGBoost regressor (XGB) [39].Each model was subjected to hyperparameter tuning using Randomized Search CV, which involved optimizing various hyperparameters such as the number of estimators, maximum depth, learning rate, kernel type, activation function, and number of neighbors, etc.
The model evaluation results were compared to identify the best-performing model for PV generation prediction.This analysis helped highlight the strengths and weaknesses of each model and facilitated the selection of the most suitable model.
The next step involves integrating this model into a MATLAB application after identifying and selecting the best prediction model based on its performance metrics (Random Forest in our work).This process typically entails exporting the model and any necessary preprocessing steps or feature engineering techniques, especially the normalization process, into a format compatible with MATLAB.Once integrated, the model can be deployed within the MATLAB application after being converted into a desktop application using MATLAB App Designer.This allows users to input relevant data and receive predictions or insights based on the model's calculations.This seamless integration facilitates real-time or on-demand predictions within the MATLAB environment, enhancing the usability and accessibility of the predictive model for various applications and users.
The methodology framework illustrated in Figure 4 guides the approach used.Data collection and preprocessing are initially set, including database exploration and normalization, followed by data segmentation into training and testing sets.In the modeling phase, the objective is to train the chosen algorithms using the training data until a satisfactory model is obtained.To achieve this goal, a Randomized Search algorithm is applied to identify the best hyperparameters for the best-performing model.Finally, the last stage entails evaluating the models using testing data and calculating estimation errors.Then, the best model, along with its scaler, is saved.Additionally, k-fold cross-validation is incorporated in the training process with a fold size of 5 to enhance the robustness of the evaluations.
Then, the best model, along with its scaler, is saved.Additionally, k-fold cross-validation is incorporated in the training process with a fold size of 5 to enhance the robustness of the evaluations.Hyperparameters are parameters set before learning begins, influencing a model's performance.Their adjustability directly impacts model effectiveness.Finding optimal hyperparameters involves trying various combinations.Over time, several approaches, like Grid Search and Random Search, have emerged for hyperparameter optimization.Grid Search, a traditional method, systematically explores a subset of the hyperparameter space through a complete search like the one used in previous work [43,44].It is evaluated using various performance metrics, commonly employing cross-validation on the training data.The Random Search Algorithm, known as the Monte Carlo method or stochastic algorithm [45], operates by iteratively sampling parameter settings from a Hyperparameters are parameters set before learning begins, influencing a model's performance.Their adjustability directly impacts model effectiveness.Finding optimal hyperparameters involves trying various combinations.Over time, several approaches, like Grid Search and Random Search, have emerged for hyperparameter optimization.Grid Search, a traditional method, systematically explores a subset of the hyperparameter space through a complete search like the one used in previous work [43,44].It is evaluated using various performance metrics, commonly employing cross-validation on the training data.The Random Search Algorithm, known as the Monte Carlo method or stochastic algorithm [45], operates by iteratively sampling parameter settings from a specified distribution [46], evaluating the model using cross-validation.In contrast to Grid Search, Random Search does not test all parameter values but samples several settings.Random Search performs more efficiently than Grid Search, as it avoids allocating excessive trials for less important dimensions to optimize the hyperparameters for all used models [47].This research employs hyperparameter tuning through a Randomized Search algorithm.The Randomized Search CV function from the sci-kit-learn library is implemented for this purpose [48].The Randomized Search CV function randomly selects hyperparameters and evaluates the results.Evaluation is conducted using cross-validation, where the data is divided into two subsets: the learning process data and the validation data.Thus, this study utilizes 5-fold cross-validation to obtain a robust model.The fundamental concept of cross-validation is to split data into two or more subsets, with one subset used to train the model and the other used for testing the model's accuracy.K-fold cross-validation is the most typical kind of cross-validation.The data are randomly partitioned into k-equal subgroups, or folds, for k-fold cross-validation.The model is tested on the last fold after being tested on k-1 folds.This process is repeated k times so that each fold is used as a testing set once.The results from each fold are then averaged to produce an overall performance estimate.Figure 5 presents the process of using the cross-validation technique with 5-fold cross-validation.
specified distribution [46], evaluating the model using cross-validation.In contrast to Grid Search, Random Search does not test all parameter values but samples several settings.Random Search performs more efficiently than Grid Search, as it avoids allocating excessive trials for less important dimensions to optimize the hyperparameters for all used models [47].This research employs hyperparameter tuning through a Randomized Search algorithm.The Randomized Search CV function from the sci-kit-learn library is implemented for this purpose [48].The Randomized Search CV function randomly selects hyperparameters and evaluates the results.Evaluation is conducted using cross-validation, where the data is divided into two subsets: the learning process data and the validation data.Thus, this study utilizes 5-fold cross-validation to obtain a robust model.The fundamental concept of cross-validation is to split data into two or more subsets, with one subset used to train the model and the other used for testing the model's accuracy.K-fold cross-validation is the most typical kind of cross-validation.The data are randomly partitioned into k-equal subgroups, or folds, for k-fold cross-validation.The model is tested on the last fold after being tested on k-1 folds.This process is repeated k times so that each fold is used as a testing set once.The results from each fold are then averaged to produce an overall performance estimate.Figure 5 presents the process of using the cross-validation technique with 5-fold cross-validation.In predictive analytics for power systems, the fusion of Python-based machine learning models with MATLAB's versatile application framework (App Designer) heralds a new era of efficiency and accuracy.The central focus of this work is the design and implementation of a user-friendly MATLAB application tailored for power prediction tasks.The application interface is created using MATLAB's intuitive App Designer tool, allowing easy interaction and seamless integration with underlying algorithms.The application will enable users to input relevant data, select prediction parameters, and visualize measured and predicted results in real-time.
Key features of the developed application include the following: Tab-Based Interface: The application is organized into tabs corresponding to different prediction tasks, such as predicting the power demand, voltage, and current.
Interactive Controls: Users can interact with various components such as buttons, state buttons, and toggle buttons to initiate prediction tasks and customize parameters.
Visualization Tools: Graphical representations, including UI Axes components, facilitate the visualization of measured and predicted data, aiding in the analysis and interpretation of results.
Export Functionality: The application allows users to export prediction results for further analysis or integration with external systems.
Streamlined Integration: Use the generated Excel file from the PV station directly for real-time prediction without any preprocessing required.In predictive analytics for power systems, the fusion of Python-based machine learning models with MATLAB's versatile application framework (App Designer) heralds a new era of efficiency and accuracy.The central focus of this work is the design and implementation of a user-friendly MATLAB application tailored for power prediction tasks.The application interface is created using MATLAB's intuitive App Designer tool, allowing easy interaction and seamless integration with underlying algorithms.The application will enable users to input relevant data, select prediction parameters, and visualize measured and predicted results in real-time.
Key features of the developed application include the following: Tab-Based Interface: The application is organized into tabs corresponding to different prediction tasks, such as predicting the power demand, voltage, and current.
Interactive Controls: Users can interact with various components such as buttons, state buttons, and toggle buttons to initiate prediction tasks and customize parameters.
Visualization Tools: Graphical representations, including UI Axes components, facilitate the visualization of measured and predicted data, aiding in the analysis and interpretation of results.
Export Functionality: The application allows users to export prediction results for further analysis or integration with external systems.
Streamlined Integration: Use the generated Excel file from the PV station directly for real-time prediction without any preprocessing required.
The designed MATLAB application offers a user-friendly interface with distinct tabs catering to various prediction types such as PV power generation (P DC ), PV voltage generation (VDC), PV current generation (IDC), yield, and loss calculations, as shown in Figure 6.
The designed MATLAB application offers a user-friendly interface with distinct tabs catering to various prediction types such as PV power generation (PDC), PV voltage generation (VDC), PV current generation (IDC), yield, and loss calculations, as shown in Figure 6.Each tab in the interface is meticulously crafted with intuitive functionality.It features clear visualization through UI Axes and streamlined operations with buttons for tasks like clearing data, triggering predictions, and exporting results.
Additionally, in the yield tab and loss tab, the following calculations are performed: Reference Yield: Yr for measured or actual, and YR for predicted.This is when the sun must be shining with G0 = 1 kW/m 2 to radiate the energy Ht to the PV array of the PV module.
Reference Yield = Ht/G0 (12) Array Yield: Ya for measured or actual, and YA for predicted.It indicates when the PV system needs to work at the nominal power of the PV array P0 to produce the output DC energy EDC.
Array Efficiency = EDC/P0 (13) Final Yield: Yf for measured or actual, and YF for predicted.It is the time that the PV system needs to operate at the nominal power of the PV array P0 to produce the output AC energy EAC.
Final Yield = EAC/P0 (14) System Losses: Ls for measured or actual, and LS for predicted.
System Losses = Array Efficiency − Final Yield (15) Array Capture Losses: Lc for measured or actual, and LC for predicted.
Array Capture Losses = Reference Yield − Array Efficiency (16) Performance Ratio: PR for measured or actual, and PR for predicted.
Performance Ratio = (Final Yield/Reference Yield) × 100 (17) The performance ratio represents the ratio between the effective energy EAC and those generated from an ideal, lossless PV installation assuming a 25 °C solar cell tem- Each tab in the interface is meticulously crafted with intuitive functionality.It features clear visualization through UI Axes and streamlined operations with buttons for tasks like clearing data, triggering predictions, and exporting results.
Additionally, in the yield tab and loss tab, the following calculations are performed: Reference Yield: Yr for measured or actual, and YR for predicted.This is when the sun must be shining with G0 = 1 kW/m 2 to radiate the energy Ht to the PV array of the PV module.
Reference Yield = Ht/G 0 (12) Array Yield: Ya for measured or actual, and YA for predicted.It indicates when the PV system needs to work at the nominal power of the PV array P0 to produce the output DC energy E DC .
Array Efficiency = E DC /P 0 (13) Final Yield: Yf for measured or actual, and YF for predicted.It is the time that the PV system needs to operate at the nominal power of the PV array P 0 to produce the output AC energy E AC .
Final Yield = E AC /P 0 (14) System Losses: Ls for measured or actual, and LS for predicted.
System Losses = Array Efficiency − Final Yield (15) Array Capture Losses: Lc for measured or actual, and LC for predicted.
Array Capture Losses = Reference Yield − Array Efficiency (16) Performance Ratio: PR for measured or actual, and PR for predicted.
Performance Ratio = (Final Yield/Reference Yield) × 100 (17) The performance ratio represents the ratio between the effective energy E AC and those generated from an ideal, lossless PV installation assuming a 25 • C solar cell temperature with the same radiation level.Figure 7 shows the evaluation of the performance ratio and losses.perature with the same radiation level.Figure 7 shows the evaluation of the performance ratio and losses.
(a) (b) The "hold on plot" button enables users to maintain the UI Axes, allowing for the simultaneous plotting of two or more graphs for comparison purposes and enhancing accessibility.
Whether it involves predicting PDC, IDC current, analyzing VDC voltages, or calculating losses, this MATLAB application allows users to integrate machine learning models seamlessly.This facilitates informed decision-making and optimizes performance in power system management.

Results
This section presents the results obtained along with the datasets used, showcasing the prediction outcomes of the PV system generation under various weather conditions.Furthermore, the results obtained from the MATLAB application are depicted in separate visualizations.
These results represent the performance metrics for different regression models across PDC datasets.Figure 8 compares the measured and predicted PDC plots using Random Forest, MLP, and k-Nearest Neighbors, and Figure 9 compares the measured and predicted PDC plots using RF.The "hold on plot" button enables users to maintain the UI Axes, allowing for the simultaneous plotting of two or more graphs for comparison purposes and enhancing accessibility.
Whether it involves predicting P DC , I DC current, analyzing V DC voltages, or calculating losses, this MATLAB application allows users to integrate machine learning models seamlessly.This facilitates informed decision-making and optimizes performance in power system management.

Results
This section presents the results obtained along with the datasets used, showcasing the prediction outcomes of the PV system generation under various weather conditions.Furthermore, the results obtained from the MATLAB application are depicted in separate visualizations.
These results represent the performance metrics for different regression models across P DC datasets.Figure 8 compares the measured and predicted PDC plots using Random Forest, MLP, and k-Nearest Neighbors, and Figure 9 compares the measured and predicted P DC plots using RF.
perature with the same radiation level.Figure 7 shows the evaluation of the performance ratio and losses.The "hold on plot" button enables users to maintain the UI Axes, allowing for the simultaneous plotting of two or more graphs for comparison purposes and enhancing accessibility.
Whether it involves predicting PDC, IDC current, analyzing VDC voltages, or calculating losses, this MATLAB application allows users to integrate machine learning models seamlessly.This facilitates informed decision-making and optimizes performance in power system management.

Results
This section presents the results obtained along with the datasets used, showcasing the prediction outcomes of the PV system generation under various weather conditions.Furthermore, the results obtained from the MATLAB application are depicted in separate visualizations.
These results represent the performance metrics for different regression models across PDC datasets.Figure 8 compares the measured and predicted PDC plots using Random Forest, MLP, and k-Nearest Neighbors, and Figure 9 compares the measured and predicted PDC plots using RF.    4 show the comparative analysis of machine learning algorithms for predicting PV power outputs.As seen in the figure, Random Forest emerges as a top-performing model across all metrics, boasting an RMSE of 21.02 kW, an NRMSE of 0.048%, an MAE of 7.40 kW, an R-squared (R 2 ) of 0.968, an IAE of 7.40 kW, and an SDD of 21.01 kW, indicating its superior predictive accuracy and robustness.Conversely, the Polynomial regressor exhibits higher errors across the board, with an RMSE of 26.5718 kW and an R-squared (R 2 ) value of 0.9347.While SVR demonstrates competitive performance, it falls slightly behind Random Forest with an RMSE of 27.1202 kW and an R-squared (R 2 ) Energies 2024, 17, 3078 14 of 23 value of 0.9319.MLP and k-Nearest Neighbors perform moderately well, but they are surpassed by Gradient Boosting, which achieves an RMSE of 23.1536 kW and an impressive R-squared (R 2 ) value of 0.9504.Linear regression, Ridge regressor, and Lasso regressor display higher errors and lower R-squared (R 2 ) values than the top-performing models.Finally, XGBoost regressor delivers strong results, with an RMSE of 24.0614 kW and an R-squared (R 2 ) value of 0.9464, further highlighting the model's predictive capability.These findings suggest that leveraging Random Forest or XGBoost regressor models would be most beneficial for accurate predictions of PV power outputs, offering superior predictive capabilities over alternative algorithms.4 show the comparative analysis of machine learning algorithms for predicting PV power outputs.As seen in the figure, Random Forest emerges as a top-performing across all metrics, boasting an RMSE of 21.02 kW, an NRMSE of 0.048%, an MAE of 7.40 kW, an R-squared (R 2 ) of 0.968, an IAE of 7.40 kW, and an SDD of 21.01 kW, indicating its superior predictive accuracy and robustness.Conversely, the Polynomial regressor exhibits higher errors across the board, with an RMSE of 26.5718 kW and an R-squared (R 2 ) value of 0.9347.While SVR demonstrates competitive performance, it falls slightly behind Random Forest with an RMSE of 27.1202 kW and an R-squared (R 2 ) value of 0.9319.MLP and k-Nearest Neighbors perform moderately well, but they are surpassed by Gradient Boosting, which achieves an RMSE of 23.1536 kW and an impressive R-squared (R 2 ) value of 0.9504.Linear regression, Ridge regressor, and Lasso regressor display higher errors and lower R-squared (R 2 ) values than the top-performing models.Finally, XGBoost regressor delivers strong results, with an RMSE of 24.0614 kW and an R-squared (R 2 ) value of 0.9464, further highlighting the model's predictive capability.These findings suggest that leveraging Random Forest or XGBoost regressor models would be most beneficial for accurate predictions of PV power outputs, offering superior predictive capabilities over alternative algorithms.Figure 10 and Table 4 show the comparative analysis of machine learning algorithms for predicting PV power outputs.As seen in the figure, Random Forest emerges as a top-performing model across all metrics, boasting an RMSE of 21.02 kW, an NRMSE of 0.048%, an MAE of 7.40 kW, an R-squared (R 2 ) of 0.968, an IAE of 7.40 kW, and an SDD of 21.01 kW, indicating its superior predictive accuracy and robustness.Conversely, the Polynomial regressor exhibits higher errors across the board, with an RMSE of 26.5718 kW and an R-squared (R 2 ) value of 0.9347.While SVR demonstrates competitive performance, it falls slightly behind Random Forest with an RMSE of 27.1202 kW and an R-squared (R 2 ) value of 0.9319.MLP and k-Nearest Neighbors perform moderately well, but they are surpassed by Gradient Boosting, which achieves an RMSE of 23.1536 kW and an impressive R-squared (R 2 ) value of 0.9504.Linear regression, Ridge regressor, and Lasso regressor display higher errors and lower R-squared (R 2 ) values than the top-performing models.Finally, XGBoost regressor delivers strong results, with an RMSE of 24.0614 kW and an R-squared (R 2 ) value of 0.9464, further highlighting the model's predictive capability.These findings suggest that leveraging Random Forest or XGBoost regressor models would be most beneficial for accurate predictions of PV power outputs, offering superior predictive capabilities over alternative algorithms.The same work and methodology used for the PDC dataset are applied to IDC and VDC Additionally, a filter to exclude all values where the PV generator was not working fo better training performance is included for the VDC dataset.Table 5 shows the result obtained for PDC, IDC, and VDC datasets, presenting various metrics and relevant parame ters.The results presented in Figures 11 and 12 pertain to the Random Forest, MLP, and k-Nearest Neighbors predictions across the test datasets for IDC and VDC, respectively.The same work and methodology used for the P DC dataset are applied to I DC and V DC .Additionally, a filter to exclude all values where the PV generator was not working for better training performance is included for the VDC dataset.Table 5 shows the results obtained for P DC , I DC , and V DC datasets, presenting various metrics and relevant parameters.The results presented in Figures 11 and 12                             The results obtained from the MATLAB application utilize the trained Random Forest regressor models, demonstrating significant success in predicting PV system generation under various weather conditions.

Discussion
The analysis carried out showcases the effectiveness of the Random Forest regressor model in capturing the intricate relationships between environmental variables and PV   The results obtained from the MATLAB application utilize the trained Random Forest regressor models, demonstrating significant success in predicting PV system generation under various weather conditions.

Discussion
The analysis carried out showcases the effectiveness of the Random Forest regressor model in capturing the intricate relationships between environmental variables and PV    The results obtained from the MATLAB application utilize the trained Random Forest regressor models, demonstrating significant success in predicting PV system generation under various weather conditions.

Discussion
The analysis carried out showcases the effectiveness of the Random Forest regressor model in capturing the intricate relationships between environmental variables and PV The results obtained from the MATLAB application utilize the trained Random Forest regressor models, demonstrating significant success in predicting PV system generation under various weather conditions.

Discussion
The analysis carried out showcases the effectiveness of the Random Forest regressor model in capturing the intricate relationships between environmental variables and PV system performance.For the PDC dataset, the Random Forest model achieved compelling performance with optimal parameters {'max_depth': When considering the utilization of machine learning for PV power estimation, comparing results with those of previous studies in the literature is valuable.For instance, in [49], a season-customized artificial neural network (ANN) was proposed to forecast the PV power of a system in Italy, achieving an average Mean Absolute Error (MAE) of 17 W. Similarly, the work in [50] reported average MAE values of 33.63 W for Support Vector regression (SVR) and 50.69W for ANN estimation of PV power output in Malaysia.
Furthermore, various methodologies have been employed to identify relevant features crucial for accurate prediction.One approach integrated correlation heatmaps with Bayesian optimization techniques, yielding an R-squared of 0.8917 when utilizing Long Short-Term Memory (LSTM) models with 41 diverse features [51].Another study utilized wavelet transformation-based decomposition techniques with various regression models, including WT-LSTM, LSTM, Ridge regression, Lasso regression, and elastic-net regression, achieving a high R-squared of 0.9505 [52].Moreover, tree-based feature importance and principal component analysis were employed in a separate study with ANN and Random Forest models [53].This research emphasized the significance of temperature, humidity, day, and time in predicting PV output, resulting in an R-squared value of 0.9355.Additionally, traditional regression models such as Linear regression, SVR, K-Nearest Neighbors regression, Decision Tree regression, Random Forest regression, Multi-layer Perceptron, and Gradient Boosting regression were assessed using Pearson's correlation and heatmap analyses, considering factors like hour, power, irradiance, wind speed, ambient temperature, and panel temperature.Among these models, Random Forest regression demonstrated the highest R-squared of 0.96, highlighting its effectiveness in predicting PV output power.These findings underscore the importance of feature selection methodologies in capturing pertinent features crucial for accurate prediction and analysis in PV systems [54].
On the other hand, Figures 15-20 vividly illustrate the predictive outcomes for two distinct days: one characterized by cloudy weather and the other by clear skies.One of the notable advantages of the approach lies in the meticulous selection of relevant features, achieved through Pearson and Spearman correlation analyses.The method ensures a comprehensive understanding of their relationships by computing both Pearson and Spearman correlation coefficients between each environmental variable and the target PV system generation.This approach enhances model interpretability and performance by incorporating linear and monotonic correlations, capturing various aspects of the data's behavior.Additionally, the integration of Isolation Forest for outlier detection enables robust data preprocessing, effectively filtering out anomalies and improving model generaliza-tion.Moreover, implementing Randomized Search CV facilitates efficient hyperparameter tuning, with Random Forest emerging as the best-performing model.Random Forest's ensemble nature and ability to handle nonlinear relationships make it particularly adept at capturing the complex dynamics of PV system generation.Its versatility, scalability, and resilience to overfitting further underscore its suitability for this application.Users can efficiently manage PV systems by seamlessly integrating the trained Random Forest model into the MATLAB application, leveraging accurate predictions to optimize resource allocation and decision-making.

Conclusions
This study has demonstrated the efficiency of employing machine learning techniques for accurately predicting PV power generation.Through meticulous data preprocessing, feature selection, and model evaluation, Random Forest is identified as the top-performing model for estimating power output from PV plants located in Algeria.Leveraging historical data and computational methods, our approach not only achieves impressive performance metrics such as a low RMSE of 19.413 and high R-squared value of 0.968 but also offers valuable insights into the significance of feature selection and outlier detection in enhancing prediction accuracy.
Furthermore, in addition to the models' evaluation, integrating the best-performing model into a MATLAB application for real-time predictions is proposed.This step not only enhances the usability and accessibility of predictive modeling in renewable energy but also lays the groundwork for practical implementation in addressing energy demands and promoting sustainability.
One potential direction to further enhance prediction accuracy and robustness is the exploration of DEPP learning and hybrid techniques.Additionally, incorporating more weather data, such as cloud cover, could improve the predictive capabilities of the models, especially in regions with variable weather patterns like Algeria.Furthermore, extending this research to consider integrating energy storage systems, such as batteries, into the predictive models could facilitate better management of intermittent renewable energy sources like solar power.By forecasting both PV power generation and energy storage levels, operators can optimize energy dispatch strategies and improve grid stability.
As we move towards a future increasingly reliant on clean energy solutions, integrating advanced computational methods holds immense promise in revolutionizing the renewable energy sector.

Figure 1 .
Figure 1.The heatmap of the outcomes of this correlation analysis.

Figure 1 .
Figure 1.The heatmap of the outcomes of this correlation analysis.

Figure 2 .
Figure 2. The feature selection is based on the correlation threshold for PDC.

Figure 2 .
Figure 2. The feature selection is based on the correlation threshold for P DC .

Figure 3 ,Figure 3 .
Figure 3. Distribution of PDC before and after removing outliers.

Figure 3 .
Figure 3. Distribution of P DC before and after removing outliers.

Figure
Figure3allows us to compare the distribution of the target variables P DC before and after removing outliers, providing insights into the impact of outlier removal on the data distribution.By identifying and removing outliers, Isolation Forest effectively isolates anomalous data points that may skew the distribution of the variables.By taking out many data points classified as outliers, Isolation Forest helps ensure that the resulting histograms accurately represent the distribution of normal data points within each bin.This enables a clearer understanding of how the data are distributed and how removing outliers affects the overall data distribution[31].The methodology employed in this study began with thorough data preprocessing steps to ensure the integrity of the dataset.Missing values were addressed through either imputation or removal and through relevant features highly correlated with the target variable.Correlated features were identified using Pearson and Spearman correlation coefficients, and then, both sets of correlated features were merged.Additionally, outlier detection and removal were performed using Isolation Forest to enhance the robustness of the models.Subsequently, the data were split into training and testing sets for model evaluation.Following data preprocessing, ten regression models were selected for eval-

Figure 5 .
Figure 5. Process of the used cross-validation technique with 5-fold cross-validation.

Figure 5 .
Figure 5. Process of the used cross-validation technique with 5-fold cross-validation.

Figure 6 .
Figure 6.The main page of the MATLAB application.

Figure 6 .
Figure 6.The main page of the MATLAB application.

Figure 7 .
Figure 7. Performance ratio (a) and loss tab (b) of the designed MATLAB application.

Figure 8 .Figure 7 .
Figure 8. Random Forest, MLP, and k-Nearest Neighbors predictions across the test datasets for PDC.

Figure 7 .
Figure 7. Performance ratio (a) and loss tab (b) of the designed MATLAB application.

Figure 8 .
Figure 8. Random Forest, MLP, and k-Nearest Neighbors predictions across the test datasets for PDC.

Figure 8 .
Figure 8. Random Forest, MLP, and k-Nearest Neighbors predictions across the test datasets for P .

Figure 10
Figure 10 and Table 4 show the comparative analysis of machine learning algorithms for predicting PV power outputs.As seen in the figure, Random Forest emerges as a top-performing model across all metrics, boasting an RMSE of 21.02 kW, an NRMSE of 0.048%, an MAE of 7.40 kW, an R-squared (R 2 ) of 0.968, an IAE of 7.40 kW, and an SDD of 21.01 kW, indicating its superior predictive accuracy and robustness.Conversely, the Polynomial regressor exhibits higher errors across the board, with an RMSE of 26.5718 kW and an R-squared (R 2 ) value of 0.9347.While SVR demonstrates competitive performance, it falls slightly behind Random Forest with an RMSE of 27.1202 kW and an R-squared (R 2 )

Figure 10
Figure 10 and Table4show the comparative analysis of machine learning algorithms for predicting PV power outputs.As seen in the figure, Random Forest emerges as a top-performing across all metrics, boasting an RMSE of 21.02 kW, an NRMSE of 0.048%, an MAE of 7.40 kW, an R-squared (R 2 ) of 0.968, an IAE of 7.40 kW, and an SDD of 21.01 kW, indicating its superior predictive accuracy and robustness.Conversely, the Polynomial regressor exhibits higher errors across the board, with an RMSE of 26.5718 kW and an R-squared (R 2 ) value of 0.9347.While SVR demonstrates competitive performance, it falls slightly behind Random Forest with an RMSE of 27.1202 kW and an R-squared (R 2 ) value of 0.9319.MLP and k-Nearest Neighbors perform moderately well, but they are surpassed by Gradient Boosting, which achieves an RMSE of 23.1536 kW and an impressive R-squared (R 2 ) value of 0.9504.Linear regression, Ridge regressor, and Lasso regressor display higher errors and lower R-squared (R 2 ) values than the top-performing models.Finally, XGBoost regressor delivers strong results, with an RMSE of 24.0614 kW and an R-squared (R 2 ) value of 0.9464, further highlighting the model's predictive capability.These findings suggest that leveraging Random Forest or XGBoost regressor models would be most beneficial for accurate predictions of PV power outputs, offering superior predictive capabilities over alternative algorithms.

Figure 9 .Figure 9 .
Figure 9. Actual and predicted plots using RF for PDC.

Figure 11 .
Figure 11.Random Forest, MLP, and k-Nearest Neighbors predictions across the test datasets for IDC.

Figure 11 .
Figure 11.Random Forest, MLP, and k-Nearest Neighbors predictions across the test datasets for I DC .

Figure 11 .
Figure 11.Random Forest, MLP, and k-Nearest Neighbors predictions across the test datasets for IDC.

Figure 12 .
Figure 12.Random MLP, and k-Nearest Neighbors predictions across the test datasets for VDC.

Figure 13 compares
Figure13compares the actual and predicted values for V DC by using RF.The same information is presented in Figure14for I DC .

Figure 13
Figure13compares the actual and predicted values for VDC by using RF.The same information is presented in Figure14for IDC.

Figure 13 .
Figure 13.Actual and predicted plots using RF for VDC.

Figure 14 .
Figure 14.Actual and predicted plots using RF for IDC.

Figure 13 . 23 Figure 13
Figure 13.Actual and predicted plots using RF for V DC .

Figure 13 .
Figure 13.Actual and predicted plots using RF for VDC.

Figure 14 .
Figure 14.Actual and predicted plots using RF for IDC.

Figure 14 .
Figure 14.Actual and predicted plots using RF for I DC .The results obtained from the MATLAB app, utilizing the Random Forest regressortrained models for prediction under different weather conditions for P DC , I DC , and VDC, respectively, are depicted in Figures 15-17 for a clear day and in Figures 18-20 for a cloudly day.

Figure 15 .
Figure 15.PDC prediction results were obtained from the MATLAB application for clear days.

Figure 16 .
Figure 16.IDC prediction results were obtained from the MATLAB application for clear days.

Figure 17 .
Figure 17.VDC prediction results were obtained from the MATLAB application for clear days.

Figure 15 . 23 Figure 15 .
Figure 15.P DC prediction results were obtained from the MATLAB application for clear days.

Figure 16 .
Figure 16.IDC prediction results were obtained from the MATLAB application for clear days.

Figure 17 .
Figure 17.VDC prediction results were obtained from the MATLAB application for clear days.

Figure 16 . 23 Figure 15 .
Figure 16.IDC prediction results were obtained from the MATLAB application for clear days.

Figure 16 .
Figure 16.IDC prediction results were obtained from the MATLAB application for clear days.

Figure 17 .
Figure 17.VDC prediction results were obtained from the MATLAB application for clear days.

Figure 17 .
Figure 17.V DC prediction results were obtained from the MATLAB application for clear days.

Figure 18 .
Figure 18.PDC prediction results were obtained from the MATLAB application for cloudy days.

Figure 19 .
Figure 19.IDC prediction results were obtained from the MATLAB application for cloudy days.

Figure 20 .
Figure 20.VDC prediction results were obtained from the MATLAB application for cloudy days.

Figure 18 . 23 Figure 18 .
Figure 18.P DC prediction results were obtained from the MATLAB application for cloudy days.

Figure 19 .
Figure 19.IDC prediction results were obtained from the MATLAB application for cloudy days.

Figure 20 .
Figure 20.VDC prediction results were obtained from the MATLAB application for cloudy days.

Figure 19 .
Figure 19.I DC prediction results were obtained from the MATLAB application for cloudy days.

Energies 2024 , 23 Figure 18 .
Figure 18.PDC prediction results were obtained from the MATLAB application for cloudy days.

Figure 19 .
Figure 19.IDC prediction results were obtained from the MATLAB application for cloudy days.

Figure 20 .
Figure 20.VDC prediction results were obtained from the MATLAB application for cloudy days.

Figure 20 .
Figure 20.V DC prediction results were obtained from the MATLAB application for cloudy days.
• 51 ′′ N and 04 • 11 ′′ E, with an elevation of 910 m above sea level.This PV plant is integrated into Ain El-Melh's medium voltage network and is part of a substantial 400 MW project overseen by the company SKTM, a subsidiary of Sonelgaz.Sonelgaz, mandated by the Algerian government for renewable energy development, has implemented 23 PV power plants across the highlands and central regions.The Ain El-Melh plant, boasting a total capacity of 20 MWp, spans 40 hectares.The key design specifications of this 20 MWp PV facility are detailed in Table

Table 2 .
PV plant monitored data.

Table 4 .
Comparative analysis of machine learning algorithms for predicting PV Power outputs.

Table 4 .
Comparative analysis of machine learning algorithms for predicting PV Power outputs.

Table 5 .
Optimization results and performance evaluation of machine learning models for pow distribution predictions.

Table 5 .
Optimization results and performance evaluation of machine learning models for power distribution predictions.

Table 5 .
Optimization results and performance evaluation of machine learning models for power distribution predictions.
20, 'min_samples_leaf': 2, 'min_samples_split': 2, 'n_estimators': 200}, resulting in a best RMSE of 21.02 kW, NRMSE of 0.048%, MAE of 7.40 kW, and IAE of 7.4076 kW; an impressive R-squared value of 0.968; and an SDD of 21.01 kW.Similarly, for IDC prediction, the Random Forest model, using the same parameters, yielded promising results with a best RMSE of 24.499 kW, NRMSE of 0.0476%, MAE of 8.089 kW, R-squared (R 2 ) of 0.957, IAE of 8.076 kW, and SDD of 24.28 kW, effectively capturing the complex nature of IDC despite fluctuations influenced by various factors.Additionally, for the VDC dataset, the Random Forest model, optimized with {'max_depth': 30, 'min_samples_leaf': 2, 'min_samples_split': 2, 'n_estimators': 150}, exhibited superior performance, achieving an RMSE of 11.691 kW, NRMSE of 0.060%, MAE of 7.424 kW, R-squared (R 2 ) of 0.953, IAE of 7.417 kW, and SDD of 11.685 kW.The analysis revealed that each regression model exhibited varying degrees of performance.The Random Forest model demonstrated competitive performance, with low RMSE, NRMSE, MAE, IAE, and SDD and a high R 2 score.SVR and MLP showed moderate performance but may benefit from further optimization or feature engineering.MLP exhibited flexibility with different activation functions and hidden layer sizes but required careful tuning to avoid overfitting.Linear regression, Ridge regressor, and Lasso regressor showed less competitive performance.Conversely, k-Nearest Neighbors, Gradient Boosting, Polynomial regressor, and XGBoost regressor demonstrated moderate to strong predictive capability.