3. Result & Discussion
This
Figure 3 demonstrates a complete evaluation of six machine learning models, Artificial Neural Network, Random Forest, XGBoost, Radial Basis Function, Autoencoder, and Decision Tree, on predicting a continuous target variable. In each subplot, you can see a scatter plot of predicted values versus actual values, with each point marked with a blue circle and a red regression line indicating the trend. The x-axis and y-axis are similarly scaled from 0 to 30 for predicted and actual values to allow for direct visual comparison of the models. This arrangement demonstrates a visual representation of their predictive performance accuracy and resulting error distributions quite effectively.
The regression lines and the distribution of points across the models illustrate different levels of performance. Artificial Neural Network, Random Forest, and XGBoost show relatively tight distributions of the points around the regression line, suggesting better predictive performance and fewer outlying points. If we look at the Radial Basis Function and Autoencoder, there are greater overall distributions of the points, particularly at higher magnitudes, which indicates a variance in predictive accuracy, as well as potential overfitting or underfitting by the models. The Decision Tree model exhibited a moderate performance, with clear evidence of deviation from the ideal y = x line that indicates better points for predictive performance accuracy.
In
Figure 4, Ensemble methods, such as voting, stacking, and blending, are applied to utilize multiple diverse models to improve the predictive performance beyond what a single model can provide. Voting ensembles combine predictions by majority vote or averaging, so they offer a robust and straightforward way to combine predictions. More advanced ensemble techniques (stacking and blending) add in a meta-learner, which is just another model that is trained on the predictions of the base models, although stacking often takes cross-validated predictions for this training, while blending uses a simple holdout set. Both stacking and blending train a meta-learner to intelligently weigh the predictions of the base models to obtain a superior prediction.
Hybrid ensemble voting is demonstrated in the
Figure 4 as a case where multiple predictive models cooperate and produce a more accurate outcome. More precisely, the method combines the predictions of different models, thus letting each one impart its strengths in solving the general issue. Using a voting system—where every model’s prediction counts toward the outcome—this technique reduces the biases of individual models and increases the reliability of predictions. The image probably gives the impression of various prediction paths meeting in the middle, thus confirming that disparate model outputs can collaboratively lead to a single and enhanced prediction.
The figure also shows stacking—a process by which multiple base models make predictions that are then combined by a meta-learner. The meta-learner is trained to leverage the signals from the base models to create the best combination of base model outputs, which can be diverse. The figure may show the relationship between the base models and the meta-learner, depicting how the meta-learner uses base predictions to generate a final output. Stacking is a useful method of capturing complex patterns in your data, as well as interdependencies among each base prediction, leading to a more careful decision than using any single model.
Blending, the technique depicted in the picture, does not require a separate meta-learner, rather it merges the predictions made by different models through averaging or weighting according to their accuracies. The image could present a simple mixing of results, thus making it easy to implement and fast this technique. Although blending might produce results that are up to standard, the process usually involves using a holdout set for validation instead of training a meta-learner, as in stacking. This aspect of blending makes it a feasible option for many use cases; however, the intricate performance that stacking can deliver in complicated situations may be the downside of blending.
The diagram shows the performance metrics for the following three ensemble techniques: hybrid ensemble voting, stacking, and blending, each showing different levels of accuracy. Hybrid ensemble voting has an accuracy of 85%, as it combines the predictions of individual models to minimize bias and variance. Stacking gives the best level of accuracy, achieving an impressive 90% as the meta-learner can effectively leverage the differences in outputs from each of the base models. Blending is 97% accurate, which shows it has strong predictive power, but not as strong of performance as stacking. These figures highlight the potential benefits of ensemble techniques to enhance predictive performance across different scenarios.
Based on the data provided in
Figure 5 analyzing a traditional building’s energy demand, this comparative study evaluates the performances of nine different machine learning models—including ANN, RF, XGBoost, RBF, AUTO, Tree, voting, stacking, and blending approaches—in predicting energy consumption metrics, where the left panel demonstrates that simpler models like Tree and ANN achieved lower RMSE and MAE values (below 2) for heating demand prediction, indicating better accuracy in forecasting overall energy consumption, while the right panel reveals that ensemble methods like voting and stacking excelled in statistical performance metrics (KGE, NSE, and R
2, approaching 0.9) for bimodal demand patterns, suggesting these advanced techniques better capture the complex, dual-mode energy-usage characteristics typically found in traditional buildings with combined heating and operational energy requirements.
The superior performance of the blended ensemble model (97% accuracy and 0.9999 correlation) can be attributed to its intrinsic ability to mitigate the individual weaknesses of the base learners while capitalizing on their strengths. The blending technique, particularly the optimal weight allocation to base model predictions, effectively dampens the variance and corrects the systematic bias that persists in single models, thereby producing a smoother, more generalized prediction surface capable of capturing the most complex, non-linear dependencies in the energy and CO2 data. Conversely, the fundamentally poor performance of the Radial Basis Function (RBF) model (0.2772 correlation) is likely exacerbated by the limited dataset size of only 100 data points. We hypothesize that the failure may stem from challenges in optimal kernel width selection, which is critical for defining the neighborhood of influence, or a high sensitivity to feature rescaling, preventing the RBF kernel from accurately mapping the complex, high-dimensional input features to the output space; this is a common issue with small datasets where the kernel cannot establish robust local relationships, highlighting the RBF’s limitation in modeling the heterogeneous nature of building energy consumption compared to the robust, hierarchical learning of the tree-based and deep neural network models.
The correlation heatmap of the model predictions reflects that nearly all machine learning models have a very strong, positive correlation with the actual target value, as shown in
Figure 6. Models such as ANN, Autoencoder, Decision Trees, and ensemble methods including voting and blending have correlation coefficients above 0.99, indicating that their predictions are nearly on target with the true value. This suggests that these models modeled the data well and provided reasonably close predictions. The algorithms’ ability to consistently perform strongly is indicative of their ability to successfully model the predictive task at hand. The Radial Basis Function (RBF) model, on the other hand, demonstrates a very weak correlation to both the true values and the other model predictions. The correlation with the true value is as low as 0.2772, and the RBF model has a correlation value less than the majority of the other models, which are all below 0.6, indicating quite likely that RBF is an outlier. This low correlation suggests that either RBF is a very poor model or is modeling potentially entirely different features of the data that the true target variable (or the mean of other model predictions) was not.
The ensemble methods, which consist of voting and stacking, exhibited correlation patterns of interest. On the one hand, voting maintained very high correlations (for instance, 0.9764 with the actual values), while, on the other hand, stacking revealed a moderately strong but comparatively lower correlation (for example, 0.6852 with the actual values). This discrepancy implies that the stacking algorithm could be integrating model predictions in a manner that introduces some variance or is relying more on the weaker base models, like RBF, while the voting method likely derives its strength from the strong consensus of the top-performing models.
In addition, the inter-correlations among the top models (ANN, XGB, RF, Autoencoder, and Tree) were so high that they hinted at the possibility of the models making similar errors or capturing redundant data. This redundancy might weaken the advantage of ensemble methods that rely on model diversity. The outstanding performance of the blending ensemble, which is almost perfectly correlated (0.9999) with the actual values, reflects its capability to use the strengths of the individual models to generate accurate predictions, possibly through optimally assigning weights to their contributions.
Using the Taylor diagram provided in
Figure 7, which includes a comparison of the different machine learning models, certain major conclusions of the performance characteristics can be drawn. The diagram maps out the model performances using the three metrics mentioned above: the radial distance from the center indicates the spread of the model predictions, the angular position specifies the correlation with the actual values, and the distance from the reference point (i.e., “Perfect Model”) indicates the centered root-mean-square difference. The ideal model would be positioned at the reference point, where a red cross marks approximately 315 degrees.
The models show different clustering groups and can be easily separated. Artificial Neural Network (ANN), Random Forest (RF), XGBoost (XGB), and Autoencoder (AutoEnc) have formed a cluster in the upper-right quadrant of the model performances, with an overall strong performance with higher correlation coefficients and only moderate standard deviations. In contrast, the RBF and Tree models are positioned quite differently than the other models, most likely indicating different performance characteristics. The voting and stacking ensemble methods, as well as the blended model, feature some of the most varied positioning. Intriguingly, the blended model was one of the closest models to the reference point for a perfect model—most other algorithms were not as close to the perfect model coordinates.
The Taylor diagram illustrates that the ensemble methods operate distinctly, with stacking and blending operating much better when proximity to the ideal reference point is greater, demonstrating that the two methods are better at replicating the observed data’s pattern and variability. They do this skillfully through their meta-learning scheme. Stacking employs a second-level model to optimally weight the base learner’s predictions together, while blending uses a holdout (validation) set to train the aggregator. In this way, stacking and blending methods can pull together the best elements of the various models in a more intelligent way that goes beyond the simplicity of weighting models together, and, instead, intelligently consider consensus based on learned accuracy. The voting method is certainly reliable and within a decent tier of ability, but as it combines model outputs based on a laid observed averaging or a majority rule vote, it is lastly not as reliable because it does not incorporate the nuance weighting that the stacking and blending methods use, so they generally have higher correlations and fewer errors.
Heating in traditional buildings releases CO
2, which greatly adds to the overall greenhouse gases produced worldwide, especially in cases where these buildings use natural gas for heating systems. The factor of 0.025 kg CO
2 per energy unit consumed is vital for establishing the carbon footprint of a building. This usually results in the total annual emissions being considerable, since the use of energy for heating, both space and water, can be over half of the total energy consumption of a building. The actual emissions data, as illustrated in the graph, show quite a wide variation from about 0 to 7 kg, which are probably the total emissions computed for the different buildings or periods, indicating that gas-fired heating systems have variable but consistently present environmental impacts, as shown in
Figure 8.
Examining the performance of the voting, stacking, and blended machine learning models against the same emissions indicates the potential of AI to enhance energy management and emissions forecasting. The “Actual” data column is labeled as the ground truth and shows the real-world variability that the models will have to reproduce. The predictions from the “Voting” model (blue) seem to follow the general shape of the actual data but noticeably deviate from the actual values for several data points. This suggests that the “Voting” model is averaging the predictions of its base learners with moderate success but does not possess the accuracy to meet all cases, especially those with higher emissions.
Conversely, the predictions made by the “Stacking” ensemble (green) exhibit a somewhat different behavior. The predictions generally appear to cluster together more tightly in the mid-range, while significantly underpredicting some of the higher actual values of emissions. This suggests that while the meta-learner from the stacking model is able to better synthesize the patterns than just a simple vote for many of the instances, it may have more difficulty capturing outliers or the upper extreme of the emissions spectrum, which could be a function of smoothing out some of the more extreme peaks of energy use and emissions.
The “Blended” model (red) seems to be the one that gives the most visually similar predictions to the actual data considering the whole range of emissions. The blending technique has been able to draw the close relationship between the input features and the CO2 output through the proximity of so many red dots to the yellow “Actual” points. This is evidenced by the model’s performance which implies a more durable and adaptable model that can accurately predict the carbon footprint of gas-heated buildings more times than not, which is very important for the strategy of targeted reductions. In the end, the precise prediction of CO2 emissions, as proven by the great performance of the blended model, is the number one requirement for the decarbonization of the building sector. Stakeholders will be able to find inefficient buildings, improve the performance of their heating systems, and check if the retrofitting or changing of habits has been effective by relying on emission forecasts. The transition from natural gas with its 0.025 kg CO2 per unit emission factor to electric heat pumps or renewable energy sources is the only long-term solution. Meanwhile, sophisticated forecasting models like these are the main tools for management and considerable reduction in the environmental impact of our current building stock.
The ensemble machine learning models tested here vary in their efficiencies to estimate the environmental position based on the natural-gas emission analysis on existing buildings. The blended model is the most impressive in its predictions, as it seems to come closest to aligning the predictions with the actual emissions across the ranges examined. It is noteworthy that the emissions factor employed (0.025 kg CO2 emitted per unit of energy input) demonstrates the carbon intensive nature of gas heating and, ultimately, that is what each model seeks to communicate. The voting ensemble, which is a reasonable benchmark model and creates a baseline by averaging its predictions, and the stacking model, which incorporates a meta-learned estimator into its methodology, are both robust models. However, there are challenges in fitting these models to extreme values and demonstrating extreme values reliably as the outputs need more advanced methods to demonstrate efficiently. While each model’s estimates of predictions advances the models capacity to identify and remediate inefficiencies in the existing building stock and implement a decarbonization strategy, especially the blended model, essentially requiring significant foresight into the building’s environmental impact before a full conversion to renewable energy is adopted.
The visualization contains two different line graphs projecting energy consumption for heating from 2021 to 2050 with an LSTM (Long Short-Term Memory) model, with the top graph entitled “Projected Energy Consumption 2021–2050”, has an absolute forecast that starts with a 2020 baseline of 100% and shows a clear downward trend with dramatic energy consumption over 30 years. The bottom graph shows the annual percent change in humidity, indicating volatility and the change rate year after year, with significant deviation before settling, as shown in
Figure 9.
One of the important aspects of such forecasts is the application of the color gradient to portray the passage of time. The progression of the line starts at blue for the immediate future (e.g., 2023) and slowly moves from purple to red for the far-off projections (e.g., 2050). This color scale is very effective in conveying the uncertainty that is characteristic of forecasting; the longer the time horizon the less certainty there is about the predictions. On the right-hand side, the legend displays each year along with its corresponding color in a detailed manner, so that the forecast for any particular year can be accurately interpreted.
The story illustrated by these charts is one of a meaningful energy transition. The steady decline in the top chart suggests energy efficiency measures have been implemented successfully, a transition to more efficient heating systems has taken place, or energy sources with better conversion efficiencies have been utilized. The considerable volatility in the annual change chart, particularly at the beginning of this process, suggests a time of rapid adoption of technology and market volatility before the new energy revolution reaches a more stable state of gradual gain towards the 2050 vision.
In complete contrast to the projected decline, a scenario where energy use for heating jumped to 135 by 2050 would amount to a significant and alarming movement away from trends we currently see. A 35% increase above 2020 levels would reflect a complete failure to retrofit buildings for energy efficiency and to decarbonize the heating sector. Some of the catalysts of this trajectory could be a rapid rise in global energy demand, increasing dependence on fossil fuels, slow retrofitting of building stock, and the effects of extreme weather, which may require higher heating load expectations. In short, the impact of this escalation in energy consumption would fundamentally jeopardize climate goals, energy security, and consumer energy expenditure and reflect the profound importance of the aggressive efficiency commitments and clean energy transition that the LSTM original modeling outlook demonstrates.