Comparison of Functionality and Evaluation of Results in Different Prediction Models †

: This article represents a further step in the continuously developing process of improving the prediction capabilities using databases. Its aim is to compare and evaluate the operation, performance and validity of knowledge extraction techniques related to prediction. The innovative part of this study concerns the selection, enrichment and processing of the database used. In particular, the database contains consumption data for an entire city over the course of a year. These data were then enriched with elements concerning the determination of the time and the environmental conditions, in order to take into consideration the correlation of the data with these parameters. Subsequently, after being converted into an editable format, they were processed using techniques such as normalization and factor analysis, which finally led to the prediction process. At this stage, different methods, such as decision trees, deep learning and generalized linear models, were applied and thoroughly analyzed, and both their operation and their effectiveness were compared and evaluated. The present effort, therefore, intends to provide a useful tool that will contribute to future efforts to improve predictions from existing data.


Introduction
The implementation of databases is, nowadays, an integral part of information technology.The increasing relevance of their use is due to the fact that their purpose is not only the simple archiving of data but also the extraction of knowledge, in the form of understandable correlations, from them [1,2].
Moreover, technological developments have made it possible to export predictions based on the analysis of recorded data [3].As a result, effective data mining techniques are now essential due to the complexity of data patterns and the growing significance of precise provisions.These techniques, with different approaches, aim to predict the future as accurately as possible; each of them applies and works most effectively under certain conditions [4,5].
Thus, the present study aims at the evaluation of these processes.This is performed by applying these different prediction methodologies to a specific type of database.This base consists of data in numeric form and reflects the electricity consumption in the wider region of the city of Kavala in Greece.In attempting to forecast trends in electricity consumption, it was found that different methods of data extraction can lead to slightly different numerical prediction results.Thus, in order to increase the accuracy of the forecasts and the understanding of the elements influencing the consumption, the database was enriched with additional data relating to time and environmental conditions.Finally, through the use of experimentation and performance evaluations, the advantages and disadvantages of each method in terms of prediction are highlighted.
The results of this study indicate which of the applied methods are most appropriate for specific types of databases.Furthermore, the knowledge gathered opens the opportunity for the creation of more precise prediction models, which can provide assistance to and have a significant impact on decision-making processes.

Data Processing
As mentioned above, the initial data were obtained from the public electricity company of Greece and included the consumption of the city of Kavala in the last two years, i.e., 2022 and 2023.These data concerned the loads in amperes of twelve transformers in the wider area of the city and were taken at half-hourly intervals over the past two years.In order to achieve the process of comparing the prediction results, it was decided that the first archive, that of 2022, would be used for the implementation of the forecasting techniques, while the other, that of 2023, would be used to evaluate the results of the above methods.
Thus, after both files were examined and cleared of missing or incorrect data [6], the 2023 file was left as it was for the purpose of being used in the final results evaluation process.However, the 2022 archive was enriched with additional data that correlated the consumption with the factors that influenced it [7].Thus, columns relating to the days of the week were added to the half-hourly consumption throughout the year of 2022.In addition, the daily temperatures were categorized into maximum, minimum and average values.In addition, the average humidity, the monthly rainfall, the rainy days, the intensity of the wind, the barometric pressure, whether the particular day was sunny or not and finally the monthly amount of solar energy using a photovoltaic panel were recorded.The remaining values were obtained from the official website of the Hellenic National Meteorological Service [8].Finally, it should be pointed out that the data were not in numeric form but transformed by substitution.For example, the days of the week were represented in seven corresponding columns.Each column was numbered 1 only for the hours in which the particular day matched it, and the rest had the value of 0. Finally, both archives were subjected to a normalization process with the aim of placing them on the same scale.
After the completion of the above processes, the two files were ready for the continuation of the procedure.

Factor Analysis
As described previously, two archives were created.The one of the year 2022, which was enriched, contained, in its final form, 32 columns and 17,520 lines.The columns corresponded to variables such as consumption, temperatures, etc., and the lines to the temporal subdivision of the whole year by half hours.This led to a total 543,151 records.Therefore, in order to reduce the number of data to be examined, and to enable the methods to be applied subsequently, it was decided to use factor analysis.Using this method, we achieved the replacement of all consumption data, contained in 31 columns, with a certain number of factors.To determine this number, the Kaiser criterion and a scree plot were implemented.This can be seen in the following Figure 1, where it is seen that the number of factors is 3.
Additionally, it should be mentioned that, for factor rotation, Varimax raw was used, with principal components for the extraction [9,10].
Following the completion of the above methodology, the set of data in the initial file was replaced by three single factors, resulting in a significant reduction in the number of data in the original file.Only the column containing the total consumption has been preserved, on which the following prediction procedures will be applied.Additionally, it should be mentioned that, for factor rotation, Varimax raw was used, with principal components for the extraction [9,10].
Following the completion of the above methodology, the set of data in the initial file was replaced by three single factors, resulting in a significant reduction in the number of data in the original file.Only the column containing the total consumption has been preserved, on which the following prediction procedures will be applied.

Prediction Process
After the completion of the processing of the files, in this section, the three forecasting methodologies chosen, i.e., decision trees, deep learning and a generalized linear model, will be applied in sequence.It is worth noting that the methodologies that were initially considered totaled five-specifically, random forest and gradient boosted trees, in addition to the above-mentioned ones.However, in the end, to mitigate the duration of this task, only the three with the best performance in their forecasts were chosen to be decomposed.
The Rapidminer software was used for the implementation of the procedure.Figure 2 below depicts the column selected for the final prediction, which is the sum of the electrical consumption, by half-hour intervals, for the year 2022.The consumption column was then compared with the three factors created from the previous process.As a result, a general graph with the performance of all methods used for the prediction was provided.The performance was measured with runtimes in (ms) and the relative error of each method individually.This includes the model's prediction

Prediction Process
After the completion of the processing of the files, in this section, the three forecasting methodologies chosen, i.e., decision trees, deep learning and a generalized linear model, will be applied in sequence.It is worth noting that the methodologies that were initially considered totaled five-specifically, random forest and gradient boosted trees, in addition to the above-mentioned ones.However, in the end, to mitigate the duration of this task, only the three with the best performance in their forecasts were chosen to be decomposed.
The Rapidminer software was used for the implementation of the procedure.Additionally, it should be mentioned that, for factor rotation, Varimax raw was used, with principal components for the extraction [9,10].
Following the completion of the above methodology, the set of data in the initial file was replaced by three single factors, resulting in a significant reduction in the number of data in the original file.Only the column containing the total consumption has been preserved, on which the following prediction procedures will be applied.

Prediction Process
After the completion of the processing of the files, in this section, the three forecasting methodologies chosen, i.e., decision trees, deep learning and a generalized linear model, will be applied in sequence.It is worth noting that the methodologies that were initially considered totaled five-specifically, random forest and gradient boosted trees, in addition to the above-mentioned ones.However, in the end, to mitigate the duration of this task, only the three with the best performance in their forecasts were chosen to be decomposed.
The Rapidminer software was used for the implementation of the procedure.Figure 2 below depicts the column selected for the final prediction, which is the sum of the electrical consumption, by half-hour intervals, for the year 2022.The consumption column was then compared with the three factors created from the previous process.As a result, a general graph with the performance of all methods used for the prediction was provided.The performance was measured with runtimes in (ms) and the relative error of each method individually.This includes the model's prediction The consumption column was then compared with the three factors created from the previous process.As a result, a general graph with the performance of all methods used for the prediction was provided.The performance was measured with runtimes in (ms) and the relative error of each method individually.This includes the model's prediction accuracy and other performance criteria, depending on the type of classification problem.The performance was calculated on a 40% hold-out set, which had not been used for any of the performed model optimizations.This hold-out set was then used as input for a multi-hold-out-set validation, where we calculated the performance for 7 disjoint subsets.The largest and the highest performance were removed and the average of the remaining five performance cases is reported here.Although this type of validation is not as thorough as full cross-validation, this approach strikes a good balance between the runtime and model validation quality.Some examples are illustrated below in Figure 3.
of the performed model optimizations.This hold-out set was then used as input for multi-hold-out-set validation, where we calculated the performance for 7 disjoint subsets The largest and the highest performance were removed and the average of the remainin five performance cases is reported here.Although this type of validation is not as thor ough as full cross-validation, this approach strikes a good balance between the runtim and model validation quality.Some examples are illustrated below in Figure 3.For the reasons mentioned above, in this study, only the generalized linear mode deep learning and decision tree methods were selected for the comparison, due to the fac that they had relatively close performance metrics.The analysis of the results of thes methods follows below.

Results
In this section, the predictions of the methods chosen will be described.These result will then be compared with the actual consumption that occurred in the following year 2023.At this point, therefore, the second archive will be used, i.e., the 2023 consumptio data, in order to evaluate the results.

General Linear Model
The first method analyzed is the general linear model method.The numerical mode used in the method is presented in Table 1.The relative error of this method is calculated to be roughly 7% and the runtime ef ficiency is nearly 0 due to the simplicity of the model.Furthermore, the graph of th generalized linear method with all of the prediction values of the consumption is de picted below in Figure 4.For the reasons mentioned above, in this study, only the generalized linear model, deep learning and decision tree methods were selected for the comparison, due to the fact that they had relatively close performance metrics.The analysis of the results of these methods follows below.

Results
In this section, the predictions of the methods chosen will be described.These results will then be compared with the actual consumption that occurred in the following year, 2023.At this point, therefore, the second archive will be used, i.e., the 2023 consumption data, in order to evaluate the results.

General Linear Model
The first method analyzed is the general linear model method.The numerical model used in the method is presented in Table 1.The relative error of this method is calculated to be roughly 7% and the runtime efficiency is nearly 0 due to the simplicity of the model.Furthermore, the graph of the generalized linear method with all of the prediction values of the consumption is depicted below in Figure 4.The main result of this method is that the value of the average electrical consumption predicted for the year 2023 is 627.8678.In comparison with the actual value of 629, which was derived from the actual electricity consumption file of 2023, the percentage error was only 0.18%.The main result of this method is that the value of the average electrical consumption predicted for the year 2023 is 627.8678.In comparison with the actual value of 629, which was derived from the actual electricity consumption file of 2023, the percentage error was only 0.18%.
The chart of the final prediction of this approach, compared with the actual data and the initial data for the comparison, can be seen in the Figure 5.The main result of this method is that the value of the average electrical consumption predicted for the year 2023 is 627.8678.In comparison with the actual value of 629, which was derived from the actual electricity consumption file of 2023, the percentage error was only 0.18%.
The chart of the final prediction of this approach, compared with the actual data and the initial data for the comparison, can be seen in the Figure 5.

Decision Tree
The second prediction methodology that will be examined is the decision tree method.This non-parametric algorithm can efficiently deal with large, complex data sets.Furthermore, this methodology is widely used for both data mining, to create classification systems, and also for the development of prediction algorithms for a target variable, as in our case.
The decision tree classifies data into branch-like blocks and creates an inverted tree-like structure, part of which is shown in Figure 6.

Decision Tree
The second prediction methodology that will be examined is the decision tree method.This non-parametric algorithm can efficiently deal with large, complex data sets.Furthermore, this methodology is widely used for both data mining, to create classification systems, and also for the development of prediction algorithms for a target variable, as in our case.
The decision tree classifies data into branch-like blocks and creates an inverted tree-like structure, part of which is shown in Figure 6.Moreover, the chart with all of the prediction values of the decision tree method can be seen in the following Figure 7.Moreover, the chart with all of the prediction values of the decision tree method can be seen in the following Figure 7.
By comparing, in a similar way, the results of this approach with the actual data available, it is concluded that this method is not so effective in this particular forecasting process.
In particular, 628.4339 is the average value predicted with the decision tree method, which indicates a 0.5661% difference form the actual target value of 629.The results are illustrated in the following Figure 8.Moreover, the chart with all of the prediction values of the decision tree method can be seen in the following Figure 7.By comparing, in a similar way, the results of this approach with the actual data available, it is concluded that this method is not so effective in this particular forecasting process.
In particular, 628.4339 is the average value predicted with the decision tree method, which indicates a 0.5661% difference form the actual target value of 629.The results are illustrated in the following Figure 8.By comparing, in a similar way, the results of this approach with the actual data available, it is concluded that this method is not so effective in this particular forecasting process.
In particular, 628.4339 is the average value predicted with the decision tree method, which indicates a 0.5661% difference form the actual target value of 629.The results are illustrated in the following Figure 8.

Deep Learning
Finally, the last method to be examined is the deep learning method.The model is displayed in Table 2.
Moreover, the chart with all of the prediction values for the deep learning method is depicted in the following Figure 9.

Deep Learning
Finally, the last method to be examined is the deep learning method.The model is displayed in Table 2.Moreover, the chart with all of the prediction values for the deep learning method is depicted in the following Figure 9.While the previous methods investigated predicted a decrease in consumption with varying degrees of accuracy, the results of this method predict an increase to 637.0455.

Conclusions and Proposals
After the completion of the above procedures and the analysis of the results, it is obvious that data mining processes are very important in creating valid prediction models.Through the application of different data mining techniques, the ability to predict and the possibility of comparing the functionality of them, as well as identifying the relative error in each case, is possible.In this particular implementation, all of the examined methods had satisfactory results regarding the correctness of the prediction process.All

Conclusions and Proposals
After the completion of the above procedures and the analysis of the results, it is obvious that data mining processes are very important in creating valid prediction models.Through the application of different data mining techniques, the ability to predict and the possibility of comparing the functionality of them, as well as identifying the relative error in each case, is possible.In this particular implementation, all of the examined methods had satisfactory results regarding the correctness of the prediction process.All of them, with different percentages of accuracy, predicted the reduction in electricity consumption for the year 2023.It should be noted that the general linear model method had the highest accuracy, with only a 0.18% percentage error, while the deep learning method had the lowest, with a 1.26% error.
Generally, all of the forecasts were very accurate, which can be explained by the way in which the database was constructed and processed.The positive contribution to the process of enriching the database with additional relevant data is evident, as is the use of the statistical method of factor analysis in order to reduce the size of the database.Overall, knowledge extraction from large data sets is vital in every aspect of science, technology and economics, due to the fact that it gives the possibility to prepare for different potential outcomes depending on the conditions and parameters taken into account each time we analyze patterns in the data.
As a complement to the present study, the same methods of extracting predictions could be used in different types and sizes of databases.It would also be possible to repeat the whole procedure without using statistical methods such as factor analysis and normalization, in order to establish the ways in which these methodologies contribute to the forecasting processes.
Figure 2 below depicts the column selected for the final prediction, which is the sum of the electrical consumption, by half-hour intervals, for the year 2022.

Figure 3 .
Figure 3. General graph of the performance of each method.

Figure 3 .
Figure 3. General graph of the performance of each method.

9 Figure 4 .
Figure 4. Graph of the generalized linear method with the prediction values.

Figure 4 .
Figure 4. Graph of the generalized linear method with the prediction values.

Figure 4 .
Figure 4. Graph of the generalized linear method with the prediction values.

9 Figure 6 .
Figure 6.Part of the decision tree model.

Figure 6 .
Figure 6.Part of the decision tree model.

Figure 6 .
Figure 6.Part of the decision tree model.

Figure 7 .
Figure 7. Graph of the decision tree method with the prediction values.

Figure 7 .
Figure 7. Graph of the decision tree method with the prediction values.

Figure 9 .
Figure 9. Graph of the deep learning method with the prediction values.

Figure 9 .
Figure 9. Graph of the deep learning method with the prediction values.

Table 1 .
The numerical model of the general linear model.

Table 1 .
The numerical model of the general linear model.