4.1. Prediction Model Implementation
This study collected a total of 779,275 original datasets over 365 days (12 months). The data from 11 months were used for training to predict daily energy consumption for one randomly selected month. This section analyzed the impact of adding the BPT label to the input variables on the data-driven model, considering five models: BP, SVR, RF, LASSO, and KNN. The balanced point temperature information was added to the input variables of the data-driven model using a categorical type label, with a value of 0 when the outdoor average temperature was below 22.2 °C and 1 when it was above.
Before conducting predictive analysis, the hyper-parameters of these models were optimized using grid search and 5-fold cross-validation, and the final optimal parameters are shown in
Table 3.
The visualization of the grid search is presented in
Figure 7.
Figure 7a shows part of the optimization process of the BPNN. As mentioned in
Section 3.2, BPNN needs to optimize four parameters. However, in order to visualize the data analysis, this study only displays the changes in the number of hidden layers and hidden nodes, while the activation function and learning rate are already optimized, with relu and 0.01, respectively. Overall, when the activation function and learning rate are constant, the influence of the number of hidden layers and hidden nodes on the performance of BPNN is limited. After iterative optimization, the final model performance only has slight fluctuations. At this point, the optimal number of hidden layers is 3 and the optimal number of hidden nodes is 50, with a cross-validation score of 0.8484.
Figure 7b shows the process of optimizing the RF. The impact of max depth is greater than the number of trees. When the max depth parameter is fixed, the change in the number of trees only has a slight impact on the predictive performance of the RF. Overall, the model performs best when the max depth is 5 and the number of trees is 20.
Figure 7c shows the optimization process of SVR. Similarly, in order to visualize the optimization process, this article only exemplifies the influence of C and gamma on the SVR, while the kernel function is RBF. As C and gamma increase to a certain extent, the score of the cross-validation set drops sharply. Overall, the optimal values for C and gamma are 0.8 and 0.23, respectively.
Figure 7d shows the optimization process of LASSO. As alpha increases, the overall performance of the model generally decreases, especially when alpha is greater than 1, the model performance is almost 0. The optimal alpha for the model is 10
−3.
Figure 7e shows the optimization process of KNN. The optimal parameter combination for the model is K = 3,
p = 3, and weights = distance.
The evaluation metrics before and after introducing the BPT label for each model are shown in
Table 4. The difference between “new” and “original” is that the input variables of the new model include the BPT label. It can be seen that the predictive performance of each data driven model is significantly improved with the introduction of the BPT label, and all evaluation metrics are significantly better, including the test set and training set. Among them, the BPNN model has the largest improvement in predictive performance, with an increase of 0.3448 in R
2 value and a decrease of 19.20% in CV−RMSE value for the test set. The KNN model has the least improvement in predictive performance, but the R
2 value has still increased by 0.144, which is also a significant improvement.
From the evaluation metrics, it can be seen that the predictive accuracy of the Lasso model is the worst. Before incorporating BPT label, its R2 value is only 0.2167. Even after adding the BPT label to the input variables, its R2 value only increased to 0.5232. Because it is essentially a simple linear regression model and although it incorporates regularization coefficients, its predictive accuracy still lags behind other more complex algorithms.
When using the training set for training, the KNN model has the best fitting effect, that is, the evaluation index of the training set is better than the other four models, regardless of whether the balance point temperature label is added or not. Its R2 value almost reaches 1.0, and the NMBE value is close to 0, which almost perfectly fits the data in the training set. However, there is a risk of overfitting, as evidenced by the fact that with the addition of the BPT label, its R2 value is not as good as expected, but inferior to BPNN and SVR, and its NMBE value even slightly increases. This indicates that KNN may lead to a decrease in prediction accuracy due to overfitting. However, this characteristic can allow KNN to play more advantages than other models when there is insufficient input variable data.
The prediction accuracy of the BPNN model is most affected by the input variables, and the prediction accuracy of its new model is 70% higher than that of the original model. This indicates that the prediction effect of BPNN is largely dependent on its input variables. Therefore, when using the BP model, it is necessary to choose appropriate input variables.
From the perspective of evaluation metrics, the prediction accuracy of the SVR and RF is similar. However, they are greatly affected by input variables. With the addition of the BPT label, their prediction accuracy has increased by about 45%.
Overall, using data-driven algorithms with the same dataset, when the input variables are insufficient, the predictive performance is in the following order: KNN, RF, SVR, BPNN, and LASSO. However, when there is sufficient input variable data, the predictive performance is in the following order: BPNN, SVR, KNN, RF, and LASSO.
In order to evaluate the statistical significance of adding the BPT label to improve prediction accuracy, a
t-test was performed on the predicted energy values. It was assumed that adding the BPT label had no significant difference in the model results. The test results, as shown in
Table 5, indicate that all
p-values are less than 0.01, rejecting the null hypothesis. This demonstrates that adding the balance point temperature label to the input variables can significantly improve the predictive performance of the data-driven model.
Figure 8 shows the prediction results of various data-driven models on daily building energy consumption with and without BPT label, i.e., using both new and original datasets, and compares them with the actual energy consumption. In each figure, “real value” represents the ground truth, “pred” represents the predicted value using the original dataset, and “new pred” represents the predicted value using the new dataset with BPT label. It can be seen that the data-driven models with BPT labels can better fit the trend of the dataset.
4.2. Importance Analysis of Input Variables
This study analyzes the building energy consumption of an apartment building in Xiamen, China. Four explanatory variables were collected and five data-driven models were established to predict the building’s daily energy consumption. A balance point temperature label was later added to determine the impact of its inclusion on the accuracy of the data-driven model predictions. The importance of each input variable, i.e., the degree of influence on energy consumption, was analyzed using the feature importance method based on the RF algorithm.
Figure 9A shows the importance of each variable in the original model. The impact of daily minimum air temperature (85%) on energy consumption is far greater than that of other variables, playing a dominant role, and its importance is much greater than that of daily maximum temperature (11%). This is an interesting phenomenon. The importance of the holiday index is almost negligible, while the importance of the sunny day index accounts for 3%. With the addition of the BPT label, the importance of the minimum temperature has decreased, and its partial importance has been shared by the balance point temperature, while the importance of other variables has remained largely unchanged, as shown in
Figure 9B. Finally, the importance of the BPT label accounts for 25%, and the importance of the minimum temperature has dropped to 62.5%. Therefore, the balance point temperature cannot be ignored in the prediction model of building daily energy consumption. This also explains why adding the BPT label to the input variables in the previous experiment greatly improves the prediction accuracy of the prediction model.
It is worth noting that in the daily energy consumption prediction model, both the daily minimum and maximum temperatures are considered temperature information in meteorological data. The importance of the daily minimum temperature dominates, while the daily maximum temperature is not as important. This indicates that the daily minimum temperature dominates people’s main thermal sensation on that day, rather than the daily maximum temperature. For example, if the minimum temperature of a day is very low, even if the maximum temperature at noon is high, people will still consider it a cold season and usually will not turn on the air conditioning briefly at noon, which means that the energy consumption of that day will not increase significantly. On the contrary, if the minimum temperature of the day is also high, people will feel that it is a hot season and choose to use air conditioning, which significantly increases energy consumption.