Beginner-Friendly Review of Research on R-Based Energy Forecasting: Insights from Text Mining

Kim, Minjoong; Kim, Hyeonwoo; Moon, Jihoon

doi:10.3390/electronics14173513

Open AccessReview

Beginner-Friendly Review of Research on R-Based Energy Forecasting: Insights from Text Mining

by

Minjoong Kim

¹,

Hyeonwoo Kim

^2,* and

Jihoon Moon

^3,*

¹

Department of ICT Convergence, Soonchunhyang University, Asan 31538, Republic of Korea

²

Department of Computer Science and Engineering, Soonchunhyang University, Asan 31538, Republic of Korea

³

Department of Data Science, Duksung Women’s University, Seoul 01369, Republic of Korea

^*

Authors to whom correspondence should be addressed.

Electronics 2025, 14(17), 3513; https://doi.org/10.3390/electronics14173513

Submission received: 14 July 2025 / Revised: 23 August 2025 / Accepted: 26 August 2025 / Published: 2 September 2025

(This article belongs to the Special Issue Intelligent Optimization and Machine Learning in Power and Energy Systems)

Download

Browse Figures

Versions Notes

Abstract

Data-driven forecasting is becoming increasingly central to modern energy management, yet nonspecialists without a background in artificial intelligence (AI) face significant barriers to entry. While Python is the dominant machine learning language, R remains a practical and accessible tool for users with expertise in statistics, engineering, or domain-specific analysis. To inform tool selection, we first provide an evidence-based comparison of R with major alternatives before reviewing 49 peer-reviewed articles published between 2020 and 2025 in Science Citation Index Expanded (SCIE)-level journals that utilized R for energy forecasting tasks, including electricity (regional and site-level), solar, wind, thermal energy, and natural gas. Despite such growth, the field still lacks a systematic, cross-domain synthesis that clarifies which R-based methods prevail, how accessible workflows are implemented, and where methodological gaps remain; this motivated our use of text mining. Text mining techniques were employed to categorize the literature according to forecasting objectives, modeling methods, application domains, and tool usage patterns. The results indicate that tree-based ensemble learning models—e.g., random forests, gradient boosting, and hybrid variants—are employed most frequently, particularly for solar and short-term load forecasting. Notably, few studies incorporated automated model selection or explainable AI; however, there is a growing shift toward interpretable and beginner-friendly workflows. This review offers a practical reference for nonexperts seeking to apply R in energy forecasting contexts, emphasizing accessible modeling strategies and reproducible practices. We also curate example R scripts, workflow templates, and a study-level link catalog to support replication. The findings of this review support the broader democratization of energy analytics by identifying trends and methodologies suitable for users without advanced AI training. Finally, we synthesize domain-specific evidence and outline the text-mining pipeline, present visual keyword profiles and comparative performance tables that surface prevailing strategies and unmet needs, and conclude with practical guidance and targeted directions for future research.

Keywords:

energy forecasting; R programming; time series prediction; ensemble learning; text mining; forecasting model taxonomy; data-driven energy analytics

1. Introduction

Recently, three converging forces have driven considerable transformations in global energy systems: the escalating threat of climate change, the steady increase in energy consumption, and the rapid deployment of renewable energy technologies [1]. Energy forecasting, which was once limited to engineering subfields, has become central to energy policy, operations, and planning. At present, accurate and timely forecasts are essential for grid balancing, market pricing, infrastructure development, resource allocation, and resilience planning [2,3]. As energy systems become increasingly decentralized and interconnected—linking electricity, heat, gas, and storage—forecasting plays a fundamental role in ensuring stability, efficiency, and sustainability [4,5,6].

The proliferation of high-resolution data amplifies this importance. Analysts can access real-time, multisource datasets from smart meters, weather application programming interfaces, building management systems, and governmental energy statistics [7,8,9]. Additionally, time series data from Internet of Things devices and satellite-based monitoring systems provide detailed insights into consumption behaviors and environmental contexts [10,11,12]. This data richness enhances the potential for precision forecasting; however, it also introduces analytical challenges, particularly for those without advanced programming or data science expertise [13,14,15].

To overcome these challenges, the R programming language has emerged as a practical and accessible solution [16,17,18]. Initially developed for statistical computing, R has evolved into a comprehensive data science environment that supports the full forecasting pipeline, including data preprocessing, feature engineering, model development, evaluation, visualization, and reporting. R’s open-source architecture promotes collaboration and reproducibility, and its rich ecosystem of packages includes tools for time series analysis, e.g., the forecast, fable, and tsibble packages; machine learning (ML), e.g., caret, randomForest, and xgboost; and interpretability, e.g., iml and pdp [19,20,21].

In comparison with other programming languages, R emphasizes statistical rigor, transparency, and ease of use, and its syntax is frequently more intuitive for users from noncomputer science backgrounds, e.g., environmental science, economics, and engineering [21,22]. Equally important, extensive documentation, open educational resources, and a vibrant user community further lower the barriers to entry. Collectively, these features make R well suited for educational contexts, interdisciplinary collaborations, and self-directed learning. Notably, R facilitates reproducible research, which is an essential feature in energy forecasting, where outcomes frequently inform policy, regulatory decisions, and infrastructure investments [23].

Despite these advantages, practical guidance on applying R to real-world energy forecasting problems remains fragmented, and many previous studies have reported results without detailing their workflows, thereby making it difficult for new users to replicate or adapt the results [24,25,26]. To close this gap and provide a consolidated perspective, this review systematically analyzes 49 peer-reviewed studies published between 2020 and 2025 in Science Citation Index Expanded (SCIE)-level journals that utilized R as a primary tool for forecasting across domains, e.g., electricity consumption, building demand, solar radiation, wind generation, heat usage, natural gas forecasting, and integrated energy systems.

Building on this demand for consolidation and the strengths of R, we first present a clear rationale for its selection and highlight areas where it outperforms competing tools. R supports text analytics, time series modeling, and reporting within a single, reproducible environment, linking feature extraction, model evaluation, and reproducible reporting without cross-language transitions. This integration reduces auxiliary “glue code” and simplifies auditing for both beginners and experienced practitioners. To ensure fairness, we also compare R with major alternatives—including Python, MATLAB, KNIME and RapidMiner, Spark/MLlib, and SAS/IBM SPSS—and summarize the key findings.

Thereafter, to augment the literature review with a structured perspective, this study applies text mining techniques, such as term frequency–inverse document frequency (TF-IDF), n-gram analysis, and latent Dirichlet allocation (LDA) [27,28]. These methods reveal common methodological patterns, frequently used R packages, and thematic trends in the considered studies. For instance, we examine the clustering of certain packages by application domain, the predominance of specific modeling approaches, and the evolution of terminology over time. These computational techniques are intended to enhance synthesis and provide practical guidance for newcomers.

As the complexity of modern energy systems increases, so does the demand for accurate and interpretable forecasting. Yet, ML models are frequently inaccessible to nonspecialists due to their algorithmic complexity [29]. Accordingly, this review is guided by the following research questions (RQs).

RQ1: In comparison with other programming environments, under what conditions does R demonstrate clear advantages for energy forecasting?
RQ2: Which families of R-based modeling approaches dominate in each application domain, and what patterns of tool usage can be observed?
RQ3: Under what data and operational conditions do these models perform particularly well or poorly (e.g., concept drift, data sparsity, and hierarchical structures)?
RQ4: Where do notable gaps remain in automation, uncertainty quantification, and interpretability—especially from the perspective of beginners?
RQ5: Which reusable resources (e.g., scripts, templates, and repositories) are most effective in lowering entry barriers for new practitioners?

The goal of this study is to assist beginners in programming and energy analytics by offering a structured introduction to modeling with R. Instead of focusing only on technical novelty, it prioritizes accessibility, reproducibility, and educational value. Drawing on 49 case studies, it distills effective workflows, common modeling strategies, and widely used packages across various energy domains [16]. Importantly, it includes a dedicated subsection comparing R with alternative software and identifying when R is preferable for conducting text mining to analyze the energy forecasting literature—positioning the study as both an academic reference and a practical guide.

In this paper, we present an accessible-to-beginners yet rigorous synthesis of R-based energy forecasting through five concrete contributions:

We compare R with alternative software and explain when R is preferable for text-driven energy forecasting.
We map methods, tools, and topics across seven domains using TF-IDF and LDA and detailed study-level coding.
We compile domain-specific comparative tables that harmonize reported metrics into standardized High/Medium/Low performance tiers, supported by study-specific evidence.
We identify recurring hurdles and operational conditions (e.g., drift, sparsity, and hierarchy) and link them to practical remedies, explicitly referencing established R workflows and packages (e.g., caret, ranger, Cubist, fable/tsibble, hts, and DALEX).
We curate a companion GitHub (Enterprise Server version 3.17.5) repository featuring reproducible scripts, workflow templates, and a structured link catalog to facilitate replication and extension by new practitioners.

The remainder of this paper is organized as follows. Section 2 reviews domain-specific studies and compares methodological trends. Section 3 outlines the review design. Section 4 presents the key findings of the study, and Section 5 discusses practical implications. Ultimately, the paper is concluded in Section 6 with concluding insights and directions for future research.

2. Literature Review

2.1. Comparison of R with Other Software for Text Mining in Energy Forecasting

To directly address RQ1, we provide a structured comparison of R with widely used alternatives, focusing on scenarios that combine text mining of the energy literature and practical forecasting workflows that can be integrated into energy management systems (EMSs). Within this context, text mining refers to extracting structured insights from unstructured sources—such as academic articles, technical reports, and operational logs—to guide model development, evaluation, and interpretation. Several programming environments are available for this purpose, each offering distinct advantages and limitations.

R, originally designed for statistical computing, has evolved into a comprehensive data science platform that integrates text mining, time series forecasting, visualization, and reporting in a single environment. In energy forecasting, its advantages include the following:

Packages such as tm, quanteda, tidytext, and text2vec support text preprocessing and feature engineering. Complementary packages like forecast, fable, and tsibble enable the direct integration of text-derived features into time series models. This reduces cross-platform dependencies and ensures a seamless transition from raw text to predictive modeling.
Tools such as R Markdown and Quarto integrate code, outputs, and interpretation within a single document, thereby ensuring reproducibility. In EMSs, where forecasts guide operational reliability and policy formulation, this transparency is critical because it provides an auditable record of decisions and strengthens trust among stakeholders who depend on consistent and verifiable outcomes.
Unlike many Python-based natural language processing (NLP) pipelines that require graphics processing unit (GPU) acceleration, R workflows run efficiently on central processing unit (CPU)-only systems. This efficiency makes R well suited for organizations with limited computing resources.

In contrast to R, which places strong emphasis on statistical precision and offers seamless integration of text mining with time series forecasting in CPU-based environments, several other programming languages and platforms are also prominent in the field of energy forecasting.

By comparison, Python offers an extensive DL and NLP ecosystem (e.g., spaCy/transformers) that benefits substantially from GPU acceleration for training large models. For the tasks emphasized here—classical time series, tree ensembles, and reproducible reporting—CPU-only pipelines in R are typically sufficient; when deep neural models are required, both R (via keras/tensorflow) and Python see meaningful speed-ups on GPUs.

MATLAB remains another established choice, particularly in engineering-focused institutions. It provides powerful numerical capabilities and includes a dedicated Text Analytics Toolbox, and its graphical user interface (GUI)-driven environment often appeals to practitioners seeking reliability and structured workflows. Nevertheless, the high licensing costs, limited open-source NLP resources, and limited adaptability for domain-specific workflows make it less attractive for open, collaborative research compared with R’s flexibility and reproducibility.

Visual programming tools such as KNIME and RapidMiner lower the technical entry barrier by allowing text mining tasks to be constructed through drag-and-drop workflows. These tools are well suited for quick prototyping and exploratory studies. Yet, they face significant challenges when applied to large or complex datasets, and their slower adoption of state-of-the-art NLP techniques reduces their suitability for long-term, reproducible forecasting pipelines.

Apache Spark, with its MLlib library, offers yet another pathway by enabling distributed computing and real-time processing of very large text corpora. This scalability is valuable for utility-scale applications that require handling of streaming energy data. Even so, the technical overhead and infrastructure demands often go far beyond what is needed for small- to medium-scale forecasting tasks. In such cases, R’s lighter CPU-based workflows provide a more efficient and accessible solution.

Lastly, commercial statistical platforms such as SAS and IBM SPSS are often used in regulated or enterprise settings where stability, compliance, and GUI-based workflows are valued. These platforms deliver reliable performance but are expensive, slower to integrate modern NLP methods, and offer fewer opportunities for combining text mining outputs directly with forecasting models. Compared with R’s open-source ecosystem, they lack the transparency and flexibility that are critical for collaborative, resource-conscious energy forecasting research.

Taken together, Table 1 provides a consolidated comparison of major software options for text mining and energy forecasting, highlighting how they differ in computational requirements (CPU vs. GPU) and in their support for reproducibility. From this overview, R stands out for its ability to operate efficiently without GPU acceleration, while at the same time linking text mining, time series forecasting, and reporting in a single reproducible workflow. These characteristics make R especially valuable for smaller research groups, public institutions, and EMS operators that often lack access to advanced computing infrastructure.

Importantly, its approachable syntax lowers the barrier for collaboration among statisticians, engineers, and domain experts, ensuring that insights from the academic literature can be more readily applied to operational forecasting contexts. Consequently, RQ1 is addressed by demonstrating that R is most advantageous when transparency, reproducibility, and CPU-based efficiency are critical to the design of EMS-ready forecasting workflows.

2.2. Review Selection Criteria: Publication Period, Search Terms, and R-Based Relevance

Beyond domain-specific insights, this review applies text mining to trace how R-based methodologies have evolved across the entire corpus. Figure 1 presents the literature selection process based on keyword search, structured screening, and the application of clear inclusion and exclusion criteria. Only studies using R for time series forecasting and reporting standard metrics were retained.

This review applied strict inclusion criteria based on publication year, keyword relevance, and tool usage to reflect recent trends and innovations in R-based energy forecasting. For instance, only studies published between 2020 and 2025 were included because this period represents a notable shift in data accessibility and modeling sophistication in the energy analytics field. This temporal scope ensures that the corpus reflects current best practices while avoiding outdated methodologies.

Additionally, studies were included only if they employed R for at least one core component of the forecasting workflow, e.g., data preprocessing, model implementation, performance evaluation, or results visualization. To qualify, each study had to employ R substantively for energy-related time series forecasting and present original empirical results. We excluded studies that only referenced R or used it exclusively for unrelated data manipulation tasks.

Moreover, the selected papers were required to report quantitative performance metrics, e.g., mean absolute percentage error (MAPE), root mean square error (RMSE), and the coefficient of determination (R²), to enable reproducibility and facilitate benchmarking. Studies that employed R alongside other platforms (e.g., MATLAB or Python) were included only if R played a central or exclusive role in the design or execution of the corresponding forecasting model.

2.3. Article Collection Process: Database, Search Strings, and Inclusion and Exclusion Criteria

The literature set was initially retrieved through keyword searches on Google Scholar to ensure broad coverage. Out of this set, we then selectively retained only studies published in SCIE-level journals indexed in the Web of Science Core Collection. This two-stage procedure balanced inclusiveness at the retrieval stage with selectivity at the screening stage, guaranteeing that the final corpus reflects peer-reviewed research of consistent quality. Conference proceedings and preprints were deliberately excluded to maintain uniformity in both review standards and bibliographic metadata. For greater accuracy, the search queries were iteratively refined using a pairing template that combined R-related terms with domain-specific forecasting keywords.

In order to achieve completeness, we then executed a three-component template:

R terms: “R programming” OR “R Studio” OR “RStudio”
Domain blocks (each paired individually with the R terms):
- “energy”;
- Electric power demand/consumption: “electricity demand” OR “electricity load” OR “electric load” OR “electrical load” OR “electricity consumption”;
- Thermal/heat/cooling: “heat demand” OR “thermal load” OR “district heating” OR “cooling load”;
- Gas: “natural gas demand”;
- Solar (irradiance + photovoltaic (PV)): “solar irradiance” OR “photovoltaic power” OR “photovoltaic generation” OR “PV power” OR “PV generation”;
- Wind: “wind power” OR “wind speed”;
- Market/price: “electricity price” OR “day-ahead price”;
- Building context: “building energy” OR “building demand” OR “building load”;
- Storage/electric vehicle (EV): “energy storage” OR “EV charging demand”.
Forecasting tail: “forecasting” OR “prediction”

We defined a reproducible keyword set that (i) spans the main energy carriers—electricity, heat, gas—and the leading renewables (solar, wind), (ii) captures common application contexts in forecasting (day-ahead prices, building demand, storage/EV charging, and a generic “energy” block), (iii) standardizes electricity-related terminology by selecting a small set of widely used load, demand, and consumption terms rather than exhaustively listing every variant, and (iv) explicitly includes solar PV by pairing irradiance terms with photovoltaic descriptors. Each domain block was combined with a forecasting tail (“forecasting” OR “prediction”) and applied consistently across the R terms (“R programming” OR “R Studio” OR “RStudio”).

The Boolean query followed a simple three-part structure:

- (“R programming” OR “R Studio” OR “RStudio”) AND “(one domain block) AND (forecasting OR prediction)”

Each domain block was substituted in turn, and the resulting sets were combined. After unioning the outputs, duplicates were removed to produce the final corpus. These search strings were designed to capture terminology commonly appearing in publication titles, abstracts, and methodological sections, accounting for variations in author usage and inconsistencies in indexing. The revised template, i.e., “(R terms)” AND “(one domain block) AND (forecasting OR prediction)”, was iterated across all domain blocks within SCIE and then unioned/deduplicated. All retrieved papers were screened at both the abstract and methods levels to determine eligibility, with screening and deduplication performed on SCIE records.

At this stage, the following criteria were applied.

Inclusion Criteria:

Published in English between 2020 and 2025 in SCIE-level peer-reviewed journals;
R was used for at least one core stage of the forecasting pipeline (data processing, modeling, evaluation, or visualization);
Focused on energy-related time series forecasting (e.g., electricity load, solar radiation, wind output, gas consumption, heating demand);
Reported at least one quantitative accuracy metric (e.g., MAPE, RMSE, and R²).

Exclusion Criteria:

R was used only for descriptive statistics or visualization without performing forecasting;
Failed to specify R package usage or did not clarify the forecasting methodology;
Duplicated publication (in which case only the most complete version was retained);
Not indexed in SCIE (e.g., conference proceedings or preprints).

The final corpus, compiled after full-text screening and deduplication, consists of 49 SCIE-indexed studies. These works cover diverse forecasting domains, modeling approaches, data resolutions, and R-based toolchains, together offering a broad yet detailed picture of how R has been applied in energy forecasting research.

2.4. Domain-Wise Literature Analysis

Within this study, we classified the literature into six overarching domains: electricity forecasting, energy forecasting in buildings, solar energy forecasting, wind energy forecasting, thermal and gas energy forecasting, and hybrid or emerging systems such as electric vehicles and energy storage. The studies within each domain were arranged chronologically, enabling readers to follow the progression of methodologies from earlier contributions to more recent advances. This organization provides a comprehensive overview across diverse energy sectors while also facilitates an understanding of the technological trends and research trajectories that have shaped the field over time.

2.4.1. Electricity Forecasting

Electricity consumption forecasting, which plays a fundamental role in ensuring grid stability, optimizing generation dispatch, managing market pricing, and minimizing outages, is one of the most mature and actively researched areas in energy analytics [52]. Relative to other energy forecasting domains, electricity load prediction benefits from the abundance of historical data and relatively well-structured inputs, e.g., time-of-day indicators and weather variables. Yet, this apparent regularity belies significant challenges, especially in localized or sector-specific contexts, where nonlinearities and unexpected variations are introduced by behavioral, environmental, and technological factors. Table 2 lists selected studies using R for electricity forecasting. It outlines dataset details, model types, performance results, and main R packages, highlighting recent trends across regions and time scales.

Recently, the R programming environment has been adopted extensively for electricity forecasting tasks across academic and applied settings. Its open-source ecosystem supports a wide range of modeling strategies, including classical time series methods, e.g., autoregressive integrated moving average (ARIMA) and exponential smoothing state space (ETS); tree-based learners, e.g., the random forest (RF) and gradient boosting methods; and advanced hybrid models that combine statistical baselines with ML or neural networks (NNs). Importantly, R’s support for hierarchical time series modeling, probabilistic forecasting, and explainable artificial intelligence (XAI) techniques closely aligns with the evolving demands of modern electricity systems.

During 2020, Gontijo and Costa [53] employed hierarchical time series models in R to analyze Brazil’s hourly electricity generation across regions and energy sources. The dataset spanned wind, hydro, thermal, solar, and nuclear generation from 2018 to 2020, and forecasting was performed using the ARIMA and ETS models via the hts package. A comparison of reconciliation strategies included bottom-up, top-down, and minimum trace (MinT). MinT exhibited the highest accuracy with the lowest MAPE and RMSE values, demonstrating the benefits of optimal aggregation in national energy systems.

In 2021, Zhang and Li [54] proposed a closed-loop clustering (CLC) framework for hierarchical load forecasting, in which clustering was iteratively linked with model fitness. Using real smart meter datasets from Ireland (January–November 2012) and incorporating South Wales PV and meteorological data, they implemented multiple linear regression (MLR) via the lm function in RStudio across all methods. Their findings demonstrated that the CLC framework consistently outperformed conventional approaches—including top-down, Gaussian mixture models (GMMs), ensemble, and bottom-up strategies—achieving MAPE reductions of 19.90%, 18.40%, 26.89%, and 52.20%, respectively.

Allee et al. [55] used survey and smart meter data from 1378 customers across 14 rural Tanzanian mini-grids to predict initial electricity demand during the first year after connection. They trained least absolute shrinkage and selection operator (LASSO) and RF models and benchmarked them against an intercept-only baseline, evaluating performance through leave-one-group-out cross-validation at the site level. The best site-level result was obtained with LASSO (median absolute percent error of 37%), followed by RF (45%) and the intercept-only baseline (62%). At the customer level, the reported median absolute error was 66%. All analyses were conducted in R (RStudio) using ggplot2, dplyr, glmnet, randomForest, and Boruta for feature screening.

By 2022, Silva et al. [56] conducted a comparative study of statistical and NN models for monthly electricity consumption in the Brazilian industrial sector using data from the Central Bank of Brazil. The series was split into January 1979–December 2018 for model fitting and January 2019–December 2020 for forecasting. The authors evaluated Holt–Winters, seasonal ARIMA (SARIMA), dynamic linear model, and TBATS (short for Trigonometric seasonality, Box–Cox transformation, ARMA errors, Trend, and Seasonal components), alongside neural network autoregression (NNAR) and multilayer perceptron (MLP). All statistical analyses and graphics were performed in R, with implementations referencing the forecast package for time series models, including TBATS and the dlm package for dynamic linear model (DLM). Across horizons, the MLP achieved the best overall performance, and it recorded MAPE = 1.48 (fit) and 3.41 (forecast), outperforming the remaining candidates on average.

Subsequently, in 2023, Zhou et al. [57] proposed D-MCQRNN (short for deep learning-based monotone composite quantile regression neural network), a deep learning (DL) architecture that combines dropout and DropConnect regularization in a monotone composite quantile regression NN. Here, the model was trained on hourly electricity load and meteorological data from Henan Province, China, and it was evaluated in terms of the continuous ranked probability score (CRPS), coverage ratio, and volumetric efficiency. The architecture was implemented using the keras and tensorflow packages in R and outperformed several benchmarks, including quantile RF (QRF), Bayesian NNs, and least-squares support vector machine (SVM) benchmarks. The D-MCQRNN model obtained up to 26.76% improvement in the CRPS.

Later, in 2024, Cabreira et al. [58] employed a hierarchical structure comprising national, regional, and state levels to forecast monthly electricity consumption in Brazil’s industrial sector. Their dataset, which was obtained from Brazil’s Energy Research Company (EPE), spanned from 2004 to 2022, offering long-term, disaggregated consumption records. Moreover, the forecasting models included ETS, ARIMA, feedforward NNs (FFNNs), and long short-term memory (LSTM) networks. The bottom-up LSTM model yielded the best performance among the hierarchical reconciliation strategies, which consisted of bottom-up, top-down, and optimal combination methods, achieving an MAPE of 2.35% and an RMSE of 433.36 GWh. The analysis employed several R packages, including the hts, fpp3, and forecast packages, and highlighted the effectiveness of disaggregated modeling in capturing localized consumption dynamics.

Ribeiro et al. [59] proposed the unit Burr XII quantile ARMA (UBXII-ARMA) model to forecast proportions bounded between 0 and 1, e.g., stored hydroelectric energy. The model, which was developed in R, combines quantile regression with the autoregressive moving average (ARMA) structure, a logit link, and harmonic seasonal covariates. Validated on Brazilian hydropower data (2000–2019), the UBXII-ARMA model outperformed the beta-ARMA and Kumaraswamy ARMA (KARMA) models in terms of the MAPE while also detecting structural breaks that other models overlooked.

Abbasabadi and Ashayeri [60] proposed the TwEn ML framework to predict hourly electricity demand in urban areas using social media activity. The dataset used in the study included geotagged tweets aligned with the New York Independent System Operator (NYISO)’s electricity demand data for 2021. The features included tweet frequency per borough and hour, which was extracted using the academictwitteR package. The implemented models included artificial NN (ANN), decision tree (DT), RF, and gradient boosting machine (GBM) models, which were evaluated using the caret, randomForest, and nnet packages. The ANN models yielded R² values between 0.72 and 0.99 depending on the season, revealing strong correlations between digital social rhythms and energy consumption.

He et al. [61] designed the hybrid MIDAS-NAMEMD-QRNN model for the mixed-frequency probabilistic forecasting of peak electricity demand. In the study, the data were sourced from Vermont and Houston, incorporating hourly load, daily weather, and policy variables. The R implementation used mixed data sampling (MIDAS) for frequency alignment, noise-assisted multivariate empirical mode decomposition (NAMEMD) for signal decomposition, and a quantile regression NN (QRNN) for quantile estimation. Kernel density estimation (KDE) translated the outputs into full probability distributions. The model achieved MAPE values of 1.19% and 1.86% for Vermont and Houston, respectively, as well as CRPS reductions relative to nine baseline models.

Mutombo et al. [62] forecast the total electricity generation in South Africa using multidecade data from the International Energy Agency covering 1990–2020. The energy sources included coal, nuclear, hydro, biofuels, oil, wind, solar PV, and solar thermal. In addition, multiple regression models were tested using R, with model m06 (comprising coal, nuclear, and solar PV) achieving the best performance (R² = 0.9988, RMSE = 807.66). The lm and ggplot2 packages were used alongside diagnostics for multicollinearity, heteroscedasticity, and outliers.

Most recently, in 2025, Zournatzidou [63] modeled renewable energy consumption using six R-based ML models, i.e., the RF, support vector regression (SVR), extreme gradient boosting (XGBoost), light GBM (LightGBM), LASSO, and MLP models. In addition, a novel predictor, the energy uncertainty index, was introduced. The LightGBM model produced the most accurate 6-month forecast (MAPE = 1.15%), followed by XGBoost. Rolling-window cross-validation and log-differencing ensured stationarity and model robustness, demonstrating the effectiveness of boosting techniques in volatile energy markets.

Keka and Cxicxo [64] compared linear and nonlinear approaches for electric load analysis using R. They assembled 15-min load data and 30-min weather data from three substations of an anonymized utility (Electrical Company Z, Region X) and aggregated them to 30-min intervals covering July 2009 to July 2014. The models included linear regression, MLR, polynomial regressions of degrees 2–4, and an interaction specification. Drawing on analysis of variance (ANOVA), Akaike information criterion (AIC), and Bayesian information criterion (BIC) comparisons, the degree-4 polynomial provided the lowest AIC/BIC values, with an adjusted R-squared of 0.07619 and an F-statistic of 279. The implementation was conducted in R (e.g., stats::lm()), with standard visualizations used to examine distributional patterns.

Short-term load forecasting (STLF) models are prone to overfitting when high-frequency sensor data and numerous engineered covariates are added without controls for parsimony [65]. Heterogeneous data sources—such as supervisory control and data acquisition (SCADA), surveys, and social media activity—can introduce target leakage and misaligned timestamps, inflating reported accuracy; strict temporal separation and careful lag design are therefore essential [66]. In multi-level planning, forecasts developed independently at feeder, regional, and national scales may not aggregate consistently. When a hierarchy exists, reconciliation methods such as the MinT framework offer a practical solution [67]. For sites with limited local history, such as EV charging stations or new buildings, generalization often suffers; transfer learning and low-data strategies can mitigate this limitation [68]. These patterns recur across the reviewed electricity studies and frequently employ ensembles with hierarchical tools in R.

2.4.2. Energy Forecasting in Buildings

Forecasting energy consumption in buildings is a crucial yet complex task within the energy analytics landscape. Unlike grid-level electricity demand, which is influenced by large-scale consumption trends and macroeconomic factors, building-level forecasting must capture localized dynamics, e.g., occupancy behavior, heating, ventilation, and air conditioning (HVAC) operation, thermal inertia, equipment schedules, and microclimatic variation. The diversity of building types, ranging from residential apartments and offices to commercial hotels and healthcare facilities, adds a layer of complexity since each class exhibits distinct temporal and structural energy patterns.

The growing availability of high-resolution building data, frequently derived from smart meters and environmental sensors, has expanded the possibilities for developing granular forecasting models, and this trend aligns well with the capabilities of the R programming environment, which offers extensive support for time series modeling, statistical learning, and model interpretability. R has been employed across a wide spectrum of applications, including STLF, daily peak demand estimation, and integrated modeling of internal conditions, e.g., temperature, humidity, and lighting levels. Moreover, energy forecasting in buildings frequently incorporates transfer-learning strategies, hybrid ensemble frameworks, and online learning algorithms, each of which benefits from R’s flexibility and modularity. Table 3 provides a summary of representative studies that applied R for forecasting energy use in buildings, based on the information available in the original papers.

During 2020, Shen et al. [69] investigated how psychological and behavioral factors affected household electricity usage through a behaviorally informed SVR model. In this case, data were collected via surveys and smart meters in Hangzhou, China, including 48 variables ranging from appliance ownership and occupancy patterns to personality traits based on the Big Five model. Using AIC, 18 predictors were retained. The SVR model with a radial basis function kernel was optimized using a genetic algorithm (GA) in R, achieving an adjusted R² value of 0.6857. Implemented with the GA, e1071, and custom SVR packages, the model forecasted household consumption and simulated reductions under targeted interventions, exhibiting up to 12.1% energy savings.

Dominguez-Jimenez et al. [70] examined seasonality in EV charging demand using aggregated load data from 44 public charging stations in Boulder, Colorado, spanning 1 January 2018 to 28 February 2019. Preprocessing, modeling, and evaluation were carried out in RStudio (Version 1.1.442). For regression analysis, RF and a quasi-Poisson generalized linear model (GLM) were compared; RF generally provided higher accuracy, although quasi-Poisson captured large demand variations more effectively. For classification, twelve algorithms were tested, and GLMNET (short for LASSO and elastic-net regularized GLMs) was ultimately identified as the best-performing model. The abstract reported test accuracy as high as 100%, while the main results indicated an average test accuracy of approximately 98.78%. Cross-validation summaries also highlighted XGBoost as achieving the highest mean accuracy and AUC among the classifiers considered.

Additionally, Zor et al. [71] evaluated two symbolic regression models, i.e., gene expression programming (GEP) and group method of data handling (GMDH), to forecast short-term electricity consumption in a large hospital in the Eastern Mediterranean, Turkey. Both models were constructed in R without prior feature selection using one year of data, including meteorological and calendar variables. The GEP model generated an interpretable equation using four inputs, and the GMDH model employed seven variables and higher-order polynomials, achieving slightly better accuracy (MAPE = 0.620% vs. 0.641%). In addition, a temporal error analysis revealed challenges during holidays and seasonal changes, emphasizing the need for flexible modeling approaches.

Moon et al. [72] addressed the cold-start problem in STLF, where a lack of historical data makes accurate prediction difficult. They introduced SPROUT (short for solving the cold-start problem in STLF using tree-based methods), which is a hybrid model implemented in R that combines two RF approaches, i.e., one trained on only 24 h of data from the target building and another trained on data from 14 other buildings. Beyond this, a transfer-learning mechanism was employed to identify the most similar building based on the load patterns using the Euclidean distance. The final forecast combined the outputs from both models and adjusted for different calendar variables, e.g., weekdays and holidays. SPROUT significantly outperformed 14 benchmark models in multistep hourly forecasts, exhibiting improvements in terms of the MAPE, RMSE, and mean absolute error (MAE) metrics.

Fan et al. [73] proposed a transfer-learning methodology for 24 h-ahead building energy prediction using the Building Data Genome dataset (507 nonresidential buildings, primarily in America and Europe), with one year of hourly data per building. Of these, 407 buildings formed the source domain and 100 the target domain (83,060 vs. 19,027 samples after windowing). All preprocessing and modeling were conducted in R using the keras package. The pre-trained network stacked two one-dimensional convolutional layers (kernel = 4; filters = 200 and 100) with a bidirectional LSTM and embeddings for categorical inputs, outputting the next 24 h load. In the source domain, the pre-trained model achieved an RMSE of 10.97 kW, compared with 16.58 kW and 14.39 kW for previous-day and previous-week benchmarks, respectively. For transfer to target buildings, two strategies were tested: feature extraction (Model A) and weight initialization (Model B). Under Scenario A (random 20–80% of a year), mean performance improvement ratios (PIRs) reached 0.490–0.483 at 20%, with benefits diminishing as data volume increased. Under Scenario B (2–10 months), mean PIRs ranged from 0.729 to 0.779 for feature extraction and 0.676 to 0.752 for weight initialization, indicating an average error reduction of at least 67%.

In 2021, Wenninger and Wiethe [74] benchmarked five R-based ML models, i.e., ANN, SVR, RF, XGBoost, and D-vine copula quantile regression (QR) models, for residential heating energy prediction in Germany. All models were trained on data from 25,000 residential buildings and validated against 345 energy performance certificate (EPC)-certified samples. They found that all models outperformed the engineering-based EPC benchmark. XGBoost obtained the best performance with a coefficient of variation of 0.329, compared to 0.614 for the EPC model. Importantly, feature importance analysis identified living area, insulation, and energy source as primary predictors.

By 2022, Fan et al. [75] introduced a conditional variational autoencoder (CVAE)-based data augmentation method to improve STLF in buildings with limited data. Implemented in R, the CVAE model employed 7-day historical consumption and seasonal variables to generate synthetic training data. In this study, two CVAE architectures were tested, one with fully connected layers and another with one-dimensional convolutional layers. Moreover, enhanced artificial NNs (ANNs) incorporating one-dimensional convolutional NN (CNN) and bidirectional LSTM (BiLSTM) layers were trained on data from 52 buildings in the Building Data Genome Project with and without augmentation. The augmented models showed a 12–18% improvement in coefficient of variation of the root mean square error (CVRMSE), outperforming traditional methods, e.g., Gaussian noise injection. The study demonstrated R’s utility in data preprocessing, generative modeling, and model evaluation.

Jozi et al. [76] proposed a contextual learning framework implemented in R to enhance the short-term energy consumption, generation, room occupancy, brightness, and temperature predictions in smart buildings. This method clusters historical data using k-means based on contextual features, e.g., ambient temperature, lighting, and occupancy. Within each cluster, various models, including SVM, hybrid neural fuzzy inference system (HyFIS), Wang–Mendel, and GFS.FR.MOGUL (short for genetic fuzzy system for fuzzy rule learning based on the MOGUL methodology) models, were trained. The system was deployed in real-time at the Research Group on Intelligent Engineering and Computing for Advanced Innovation and Development (GECAD) Building N in Portugal and integrated 15-min sensor data into a building energy management system. It improved the forecast accuracy compared with LSTM baselines and enabled automated control suggestions for HVAC and lighting systems.

Kim et al. [77] developed a multi-level stacked regression framework to predict 15-min-ahead electricity consumption in a steel hot rolling mill (HRM). At Level 0, the framework incorporated linear regression, RF, GBM, and SVM, whose out-of-fold predictions were subsequently combined via ridge regression at Levels 1 and 2. ARIMA and ARIMA with exogenous variables (ARIMAX) models served as benchmarks. The study was conducted in R 3.6.3 and RStudio 1.3, using plant operation records aggregated into 15-min intervals from 1 January 2019 to 30 January 2020. Across multiple time-ordered train/test splits, the stacked ridge meta-model (RG2) demonstrated the highest predictive accuracy, achieving an MAPE of approximately 7.4–8.2% for +15-min forecasts, and outperforming both manual operator estimates and individual baseline models.

Moon et al. [78] developed the ranger-based online learning approach (RABOLA) model for robust STLF in buildings with dynamic consumption patterns. In the first stage, they trained several ensemble models, i.e., RF, GBM, and XGBoost, on historical data. Their predictions, along with time-based (hour, weekday, and holiday) and environmental (temperature and humidity) features, were then used as inputs in the second stage, where an online RF model was implemented using the ranger package in R, updated via a 7-day sliding window. Using hourly data from two office buildings in Washington, USA, RABOLA achieved strong performance (MAPE = 11.03%, CVRMSE = 18.68%), outperforming DL models, e.g., gated recurrent unit and attention-based LSTM models. Feature importance and partial dependence plots (PDPs) enhanced interpretability.

Subsequently, in 2023, Kondi-Akara et al. [79] analyzed daily per-capita electricity consumption for twelve West and Central African cities using a nonstationary MLR framework in which both the base load and weather sensitivities varied over time. Key predictors included cooling degree days, a humidity index, and wind speed. Their results indicated that temperature explained approximately 25–70% of the day-to-day variability in electricity demand, with each 1 °C increase associated with a 3–4% rise in base consumption in coastal cities such as Mindelo and Dakar, and a 6–10% rise in most Sahelian and tropical cities. Importantly, humidity effects reached up to 70% of the temperature effect in several Sahelian and tropical locations, underscoring the importance of nonstationary demand dynamics.

Later, in 2024, Zhang et al. [80] developed a forecasting framework for high-rise hotel energy use in Guangzhou, China, by combining EnergyPlus simulation outputs with ML. They defined six prototype hotel configurations based on 78 architectural layouts, simulated 5000 buildings using Latin hypercube sampling, and identified key predictors through standardized regression coefficients. In this study, 15 R-based models were evaluated, with quadratic polynomial regression demonstrating the highest accuracy and stability (R² > 0.95, error range = 2.38–8.11%). In the study, all analyses were conducted in R, demonstrating its effectiveness for high-resolution forecasting in the hospitality sector.

Most recently, in 2025, Cebeci and Zor [81] examined the impact of COVID-19 on electricity demand at a university hospital using deep polynomial NNs (DPNNs) and GEP. Their dataset included 15-min electricity usage, meteorological variables, and pandemic indicators, e.g., case counts and restriction levels. The DPNN model achieved superior normalized RMSE (5.66% hour-ahead and 11.13% day-ahead) and was five times faster than the GEP model. It is worth noting that both models were developed in R, highlighting the importance of integrating public health variables in high-resolution energy forecasting.

Timur and Üstünel [82] considered hourly electricity consumption forecasting for a 24/7 industrial facility in Adana, Turkey. Their dataset integrated over 30,000 observations of SCADA-based energy readings and Modern-Era Retrospective Analysis for Research and Applications version 2 (MERRA-2) meteorological variables, e.g., temperature and humidity. Additionally, they tested five forecasting models, i.e., MLR, GMDH, MLP, gradient-boosted decision tree (GBDT), and GEP models. Preprocessing and modeling were performed in R using the dplyr and caret packages, as well as custom scripts. The GBDT model obtained the most accurate results with an MAPE of 0.827%, indicating its strength in high-dimensional, nonlinear applications.

At the building scale, models often overfit when numerous occupancy, HVAC, and microclimate variables are applied to short historical records [83]. Regime shifts—such as holidays, retrofits, or maintenance—disrupt stationarity and degrade performance; online (RABOLA) and transfer-learning (SPROUT) approaches consistently stabilize forecasts under such conditions [84]. Small-sample facilities face additional challenges from data sparsity and missing values; data augmentation methods such as CVAE-based generation and robust imputation provide interim solutions until sufficient data are collected [85]. Where simulation tools are available, integrating physics-based models (e.g., EnergyPlus-based hybrid forecasting) with ML enhances extrapolation and reduces the brittleness of purely statistical approaches [86]. Equally important, contextual and smart-building frameworks and pandemic-sensitive modeling highlight the importance of adapting to behavioral and external disruptions. Taken together, these strategies improve forecast stability without compromising interpretability in the building-scale studies reviewed here.

2.4.3. Solar Energy Forecasting

Although benefiting from relatively regular diurnal and seasonal cycles, solar energy forecasting presents its own set of challenges due to the influence of weather variability, cloud cover, and the spatial heterogeneity of irradiance levels. The integration of distributed solar PVs into modern power systems requires forecasts that are both accurate and adaptable across multiple scales, from rooftop systems to utility-scale solar farms. In contrast to demand-side energy forecasting, which is frequently dependent on behavioral and economic signals, solar forecasting is intrinsically dependent on environmental and atmospheric conditions, thus rendering meteorological modeling an essential component of the task.

Against this backdrop, the reviewed studies demonstrate how R has been employed to develop diverse solar forecasting models, including ML-based approaches to predict solar irradiance and PV output, time series techniques for clear-sky index (CSI) forecasting, and postprocessing frameworks for probabilistic output generation. Some studies incorporated satellite-derived or ground-based weather observations, and others utilized clustering methods or ensemble learners to enhance predictive stability under varying sky conditions. This methodological diversity reflects the multifaceted nature of solar forecasting and highlights R’s adaptability in supporting the development, evaluation, and deployment of exploratory models. Table 4 presents key R-based approaches for forecasting solar energy across various studies.

During 2020, Yagli et al. [87] evaluated the feasibility of using bias-corrected satellite-derived global horizontal irradiance (GHI) as the sole input for univariate ML forecasts. Using data from 15 Baseline Surface Radiation Network (BSRN) stations, several models, including the Cubist, generalized linear model via penalized maximum likelihood, SVM, RF, and projection pursuit regression models, were trained using R’s caret package. Moreover, kernel conditional density correction enabled comparable accuracy between satellite-based and ground-based forecasts, supporting scalable approaches in sensor-limited regions.

Yagli et al. [88] developed a probabilistic forecasting framework that combined 20 ML and time series models to predict the CSI. Implemented using data from the U.S. SURFRAD (short for Surface Radiation Budget) network, the forecasts were postprocessed via generalized additive models (GAMs) for location, scale, and shape (GAMLSS) and QRFs. GAMLSS allowed for detailed modeling of the distributional properties, and the QRF offered nonparametric flexibility. Both methods improved the CRPS considerably, with a skill score gain of up to 58.12% over climatological baselines.

Dobreva et al. [89] presented a comprehensive evaluation framework for PV performance forecast models that combined graphical residual analysis with two scale-independent metrics (s and mm). Using monthly outputs from two grid-connected systems in Namibia—NamPower 1 (p-Si, 63.455 kW) and Güldenboden (CdTe, 15.660 kW)—they developed four PVSYST variants (NP1X, NP1Y, GnX, and GnY) and assessed them in RStudio. Across both qualitative and quantitative comparisons, NP1Y (with shading correction) delivered the best performance, while GnX was the least accurate. For example, NP1Y achieved s = 0.467 and mm = 0.957 under the proposed evaluation metrics. The dataset consisted of AC energy records extracted from the SMA (short for System-, Mess- and Anlagentechnik) Sunny Portal, covering September 2012–July 2017 (59 months) for NamPower 1 and August 2014–December 2018 for Güldenboden, with the 2018 records flagged, leaving 41 reliable months.

In 2021, de Freitas Viscondi and Alves-Souza [90] compared SVM, artificial NNs (ANNs), and extreme learning machines (ELMs) for daily solar irradiance prediction in São Paulo, Brazil, using the IAG-USP (short for Institute of Astronomy, Geophysics, and Atmospheric Sciences of the University of São Paulo) meteorological dataset (final modeling set: 19,359 daily records, 1962–2014). Data ingestion, model training, and evaluation were conducted in RStudio with the e1071, neuralnet, and ELMR packages. The dataset was shuffled, normalized, and divided into 15,000 training and 4358 testing instances, and model performance was assessed using MAE, RMSE, and Pearson’s correlation coefficient. Among the parameter-group experiments, SVM produced the lowest RMSE, whereas ELM trained substantially faster. The best SVM configuration (SVM_4) achieved MAE = 2.05 MJ/m² and RMSE = 2.78 MJ/m² with r = 0.89, while the best ELM setup (ELM_4) delivered MAE = 2.35 MJ/m² and RMSE = 3.09 MJ/m², with training approximately 94% faster than SVM.

By 2022, Mukilan et al. [91] presented a rooftop-level PV forecasting method using restricted Boltzmann machines (RBMs) implemented in R. The study addressed environmental nonstationarity by applying a data pipeline involving cleaning, scaling, and k-fold cross-validation. The RBM models outperformed ANNs and backpropagation NNs in terms of MAPE and F-measure, reaching up to 99% accuracy. While sensitive to initialization, the RBM models demonstrated a strong capacity for capturing nonlinear dependencies.

Most recently, in 2024, Masache et al. [92] compared decision tree-based approaches—RF and its quantile hybrid QRRF (short for quantile regression RF)—with a quantile generalized additive model (QGAM) for short-term GHI forecasting. The dataset consisted of hourly GHI observations collected between March 2017 and June 2019 at the NUST (short for Namibia University of Science and Technology) radiometric station in Windhoek, Namibia. Missing values were imputed using Hmisc, hierarchical interactions were selected with hermit, and the models were implemented in R 4.3.2 (randomForest, quantregForest, mgcViz; simulations via JWileymisc). In simulation experiments, QGAM outperformed QRRF on pinball loss and mean absolute scaled error (MASE). For the Windhoek case study, QGAM produced the lowest RMSE (21.233) and CRPS (205.39), whereas QRRF achieved the lowest MAE (11.11), MASE (0.1177), and pinball loss (5.555). A Diebold–Mariano test (p = 0.2943) suggested that the overall predictive accuracy of QGAM and QRRF did not differ significantly.

Forecast accuracy declines under rapid cloud dynamics and in areas without dense ground-sensor coverage [93]. Two effective strategies include (1) applying weather-regime classification before modeling and (2) using bias-corrected satellite GHI where local measurements are unavailable [94,95]. Another persistent drawback is poor uncertainty calibration, where point forecasts conceal operational risk; probabilistic post-processing methods such as GAMLSS and QRF improve CRPS and enhance decision-making utility [96]. Model choice (e.g., SVM vs. ELM) also can influence robustness under long historical datasets. Seasonal imbalances, particularly in winter and autumn, can shift feature importance [97]; rule-based or hybrid learners and neural approaches for rooftop PV often demonstrate greater robustness across seasons.

2.4.4. Wind Energy Forecasting

Wind energy production forecasting presents distinct challenges due to the inherently volatile and geographically variable nature of wind. For instance, wind speeds are highly sensitive to short-term meteorological fluctuations, terrain complexity, and elevation, unlike solar irradiance, which typically exhibits predictable diurnal and seasonal patterns. This variability reduces the accuracy of the output predictions and introduces operational risks for grid integration and scheduling, particularly in regions with renewable energy portfolios that are heavily reliant on wind power. Wind forecasting models must incorporate both temporal and spatial dynamics across various forecasting horizons to address these challenges.

In this domain, the R programming environment supports a broad spectrum of methods, including classical time series approaches and advanced ML and ensemble techniques. Many previous studies have employed autoregressive models, e.g., the ARIMA and SARIMA models, for short-term forecasting because of their transparency and interpretability. Yet, other studies have employed ML models, e.g., RF and SVM models, to capture the nonlinear relationships and interactions among meteorological variables. Moreover, hybrid or hierarchical structures are being adopted increasingly to support multiscale forecasting and temporal reconciliation. Table 5 presents key R-based approaches for forecasting wind energy across various studies.

During 2020, Zhang et al. [98] investigated the resilience of wind forecasting models against false data injection attacks (FDIAs). Implemented in R and RStudio using Monte Carlo simulations, the study compared three deterministic models, i.e., multiple nonlinear regression (MNR), ANN, and SVM models, and three probabilistic models, i.e., QR, QRNN, and k-nearest neighbors (KNN) with kernel density estimation (KNN-KDE) models, on the GEFCom2014 dataset. The SVM and KNN-KDE model demonstrated the highest resistance against data corruption, maintaining lower RMSE values and quantile loss under increasing attack severity. However, all models collapsed under full-scale FDIAs, highlighting the urgent need for cybersecurity-aware designs in forecasting architectures.

In 2021, Costa et al. [99] introduced the analog-based dynamic time scan forecasting (DTSF) model for multistep wind speed prediction. The DTSF model matches the current observations with historical sequences using polynomial similarity functions and extrapolates future values accordingly. An ensemble version, referred to as eDTSF, aggregates forecasts from multiple analogs to enhance robustness. Validated on 241,200 wind speed observations from Bahia, Brazil, the DTSF model outperformed 11 benchmark models across several error metrics. Implemented in R via the custom DTScanF package, the study demonstrates R’s ability to support high-resolution, analog-based forecasting frameworks.

Most recently, in 2024, Prieto-Herráez et al. [100] developed the two-stage ensemble optimization for load operations (EOLO) wind forecasting system for integration into Spain’s electricity market. Developed entirely in R, EOLO first minimizes prediction error using an ensemble of eight ML models. It subsequently adjusts the forecasts to optimize the economic returns. Leveraging meteorological data, market pricing, and historical production records, EOLO achieved an average prediction error of less than 8% and increased profitability by up to 2% in selected wind farms. The system is also scalable and requires no manual configuration when applied to different sites, thereby making it suitable for operational deployment.

English and Abolghasemi [101] proposed a hierarchical wind forecasting framework that reconciles predictions across temporal resolutions ranging from 5 min to 1 h. Then, linear regression models were constructed at each temporal level using R’s hts package and reconciled using bottom-up, top-down, and MinT methods. The results of empirical evaluations confirmed that the integrated hierarchical model outperformed the separately trained approaches for each resolution. These findings underscore the importance of temporal consistency and highlights the strengths of R in hierarchical forecasting applications.

Wind forecasting studies often contend with data sparsity—such as limited mast measurements or coarse reanalysis grids—and terrain-induced nonstationarity, both of which degrade the performance of classical ARIMA and SARIMA baselines [102]. Under these conditions, analog and ensemble approaches (e.g., DTSF and eDTSF) as well as tree-based learners have shown more robust gains [103]. Multi-horizon forecasting can introduce temporal inconsistencies when 5-min, 15-min, and hourly models are trained separately; hierarchical reconciliation methods (e.g., MinT) restore cross-resolution coherence [104]. More recently, a critical issue has emerged in the form of adversarial or corrupted telemetry, such as false data injection attacks (FDIAs), where several models fail catastrophically under severe tampering; robust preprocessing and quantile-based learners have demonstrated stronger resilience in testing [105].

2.4.5. Thermal and Gas Energy Forecasting

Accurate forecasting of thermal energy demand is a critical task for cities and utilities pursuing decarbonization and efficiency in heating systems. In contrast to electricity forecasting, which frequently targets immediate load fluctuations, the prediction of thermal demand and natural gas consumption must address slower, behavior-driven dynamics that are influenced by various factors, e.g., building insulation, occupancy patterns, climatic conditions, and heating system design. Taken together, these factors interact across seasonal and diurnal cycles, creating modeling challenges that require both physical and data-driven insights.

Recent investigations using the R programming environment have highlighted a wide spectrum of modeling strategies tailored to thermal and gas applications, encompassing linear regression to estimate transmission and infiltration losses, ML approaches (e.g., ANN and SVR models) to capture short-term and seasonal patterns, and hybrid frameworks integrating physically simulated data with empirical modeling techniques. R’s flexibility in supporting diverse analytical pipelines has made it an effective tool for fine-grained, building-level prediction and broader district heating and natural gas load forecasting. Table 6 presents key R-based approaches for forecasting thermal energy across various studies.

During 2020, Liu et al. [106] proposed a hybrid prediction model for hourly district heating load based on association rule mining and SVR. In this study, two feature selection techniques were evaluated, i.e., one based on the Spearman correlation coefficient and the other using the Eclat algorithm to extract frequent itemsets, both of which were implemented in RStudio. The Eclat-based SVR (E-SVR) model outperformed its Spearman-based (S-SVR) counterpart, reducing the RMSE value by 28.1% and improving accuracy by 8.2%. The historical water supply temperature was the most significant predictor. The study highlights R’s ability to combine unsupervised feature selection with supervised ML techniques for thermal energy analytics.

Bujalski and Madejski [107] forecasted combined heat and power (CHP) heat production in the heating season (November–March) using a GAM with weather inputs (air temperature, solar irradiation, and wind speed) and hour of day, calibrated adaptively with a 12-day moving window. With the inclusion of irradiation and wind (Model M1) improved accuracy, particularly in March. Across the season, the average MAPE was below 7%, with monthly MAPE between ~5.2% and 7.5%, and RMSE reductions were observed when solar effects were incorporated.

In 2021, Li and Yao [108] designed a hybrid framework that integrates EnergyPlus-based physical simulations with ML to predict heating and cooling demand across 442 buildings in Chongqing, China. In the study, energy use intensity values simulated via the Urban Modeling Interface were employed to train 10 ML models, including SVR with linear, polynomial, and Gaussian kernels; RF; XGBoost; ANN; ordinary least squares; ridge regression; LASSO; and elastic net models, all of which were implemented using R’s caret package. At the individual building level, the polynomial SVR model achieved the highest accuracy, and the Gaussian SVR model performed best in terms of aggregated demand predictions. The data-driven models provided over 1000-fold speed improvements compared with simulation runs, thereby enabling rapid assessment of retrofit scenarios.

Dulce-Chamorro and Martínez-de-Pisón [109] developed parsimonious predictive models of hospital cooling energy demand for San Pedro Hospital in Logroño, Spain, using building management system (BMS) data. The measurements were aggregated on an hourly basis, missing values were interpolated, thermal power and energy were derived, and the target variable (ENERGYKWHPOST) was smoothed with a Gaussian filter (window = 11). The final feature set included calendar indicators and temperature-related variables. Two modeling stages were carried out: the first used data from January 2017–February 2018 for training and March 2018–February 2019 (even/odd weeks) for validation and testing, while the second followed system optimizations introduced around April 2018. Models were optimized with GAparsimony in R, focusing on SVR with RBF kernel, ANN, and XGBoost. SVR delivered the best single-model performance with three features in the initial round, and an ensemble of SVR, ANN, and XGBoost further improved RMSE compared to the best individual model.

Żymełka and Szega [110] developed a short-term heat demand prediction model for gas-fired CHP systems equipped with thermal storage. Using a feedforward ANN with five hidden layers, implemented entirely in R, the model used the ambient temperature, wind speed, humidity, and previous 24 h heat demand as inputs. Trained on hourly data from the heating season using a 70–15–15 data split, the ANN significantly outperformed an exponential regression benchmark, achieving RMSE, R², and MAPE values of 7.18, 0.9886, and 3.65%, respectively. The findings of the study highlight the feasibility of R-based ANN models for accurate thermal load forecasting in complex energy infrastructures.

By 2022, Shin and Cho [111] developed and evaluated machine learning models to predict the coefficient of performance (COP) of an air-cooled heat-pump system using operational data collected in a university laboratory. The input variables included heat-source and load-side inlet and outlet temperatures, heat-pump power, and indoor and outdoor air temperatures, all measured at one-minute intervals. The dataset was divided into 70% training (5124 samples) and 30% testing (2196 samples), and the models were implemented in RStudio 1.2.1335. Four algorithms were compared—ANN, SVM, RF, and KNN. Among them, ANN achieved a mean bias error (MBE) of −3.6 and a CVRMSE of 5.4% (error range: −7.8–9%), thereby meeting ASHRAE (short for American Society of Heating, Refrigerating and Air-Conditioning Engineers) Guideline 14 criteria. Results for SVM, RF, and KNN were also summarized for comparison. Ultimately, the ANN model was deployed in the building automation system (BAS) to enable real-time performance monitoring.

Subsequently, in 2023, Pala [112] proposed a multihybrid long-term forecasting framework for monthly natural gas consumption in the U.S. vehicle fuel and industrial sectors. Using the forecastHybrid package in R, six models, i.e., auto.arima, nnetar, stlm, thetam, ets, and tbats, were combined under equal-weighted (EW), variable-weighted, and cross-validated weighting (CVW) schemes. The CVW ensemble yielded the best results for the industrial dataset (MAPE = 3.19%), whereas the EW ensemble yielded the best results for the vehicle fuel data (MAPE = 5.40%). The results indicated that the ensembles of statistical and neural models outperformed the standalone DL models, e.g., the MLP and ELM models, particularly for highly seasonal and nonlinear time series.

Bujalski et al. [113] modeled day-ahead district-heating load in the off-season (June–August) using a generalized additive mixed model (GAMM)—a GAM extended with an autoregressive error term—and a 14-day sliding training window, incorporating calendar-pattern smooths. This configuration eliminated residual autocorrelation and achieved the highest test accuracy. The best-performing variant (AR order ≈ 1–3) yielded an RMSE of 0.84 MW and an MAPE of 3.26% for an average load of ~21 MW, with monthly MAPE ranging from 2.7% to 3.9% (lowest in July). Ambient temperature contributed minimally during the transitional months, making calendar effects the dominant predictors.

Most recently, in 2025, Tudor et al. [114] developed an ML-based analytical framework to assess and forecast fossil fuel reliance across the EU-27, modeling the share of fossil fuels in the final energy consumption using an RF regressor. This model incorporated six predictors, i.e., gross domestic product (GDP), population, industrial production, CO₂ emissions, share of renewable energy, and energy intensity, and it was implemented in R using the randomForest, caret, iml, and pdp packages. The performance of the model was evaluated using leave-one-out cross-validation with interpretability enhanced via Shapley additive explanations (SHAP) values and PDPs. The results projected a reduction in fossil fuel share from 1.8% in 2022 to 1.33% by 2030, aligning with the EU Green Deal targets and illustrating the efficacy of interpretable ensemble learning under constrained data conditions.

Thermal demand patterns are affected by shoulder-season regime shifts and occupant behavior, which many models fail to capture [115]. For instance, incorporating irradiation and wind improved accuracy during transitional months, whereas calendar-pattern smooths and autoregressive terms stabilized forecasts in off-season periods. Applying seasonal filtering and change-point detection can improve stability during these transitional periods [116].

Smart meter data streams are frequently noisy or incomplete, making thorough cleaning and imputation essential before any model comparison [117]. Hospital-scale studies highlighted the importance of handling missing values, while association-rule mining extracts robust predictors. Combining simulation outputs with ML enables generalization to climates or retrofit conditions beyond those observed in the training data integrating EnergyPlus simulations with ML. Moreover, interpretability can be maintained through association-rule mining and SHAP/PDP analyses help system operators trust the results [118].

At the broader district heating network (DHN) level, aggregated models can mask infrastructure heterogeneity; segment-specific approaches revealed fine-grained variation and avoided bias from averaging. Ensemble strategies for long-term gas consumption further illustrate the importance of flexible modeling. These findings confirm that R-based pipelines effectively balance accuracy, stability, and interpretability in thermal and gas demand forecasting [119].

2.4.6. Hybrid and Emerging Energy Systems

Beyond conventional energy domains, specialized forecasting tasks that fall outside the boundaries of single-source or sector-specific energy modeling have been investigated. Such tasks encompass predicting stored hydropower levels, simulating energy usage during public health crises, forecasting battery behavior in UPS systems, predicting EV charging demand, and assessing building-level micro-grid dynamics. Although diverse in scope, these studies share a methodological emphasis on adapting traditional forecasting frameworks to accommodate irregular data structures and domain-specific constraints.

The versatility of the R programming environment and its modular package ecosystem makes it particularly effective for addressing such unconventional forecasting challenges. For instance, whether estimating bounded time series, integrating clustered regressors into ARIMA, modeling behavioral features in electric mobility, or transferring shared structures across charging stations with Gaussian processes, R offers robust support through customizable model architectures, diagnostic tools, and user-friendly visualization. The ability to model novel problem settings flexibly is crucial as energy systems become increasingly interconnected and policy-relevant, particularly for analysts working at the intersection of energy, climate, and society. Table 7 summarizes representative models and key innovations in hybrid and unconventional energy forecasting studies.

During 2020, Haider et al. [120] proposed a cluster-assisted ARIMA framework for forecasting battery voltage in data-center UPS systems, in which cluster members were incorporated as external regressors in the ARIMA model. The dataset was obtained from a large-scale social media company in China and contained one year of one-minute measurements (470,226 records) across 40 VRLA batteries. During the observation period, four discharge cycles and three power surges were recorded. Clustering was performed on a monthly basis using k-shape and DTW. When inconsistency among cluster members was detected, the relevant members were passed as external regressors (xreg) to ARIMA, implemented with the forecast package, while clustering was executed using dtwclust. Across batteries and performance metrics (RMSE, MAE, MAPE), the k-shape-clustered ARIMA (CK) improved accuracy compared with both single-battery and total predictors, and also outperformed the DTW-clustered ARIMA (CDTW).

Almaghrebi et al. [121] focused on session-level electricity demand forecasting for plug-in electric vehicle charging events. Here, the dataset included more than 22,000 real-world charging sessions collected over seven years from public stations in Nebraska. Each session’s input features included charging history statistics, session timing, pricing policy, and user behavior. Subsequently, R was used to implement and compare linear regression, support vector machine (SVM), RF (via ranger), and XGBoost models with tuning via the caret package. The XGBoost mode achieved the best RMSE and R² values of 6.68 kWh and 51.9%, which were further improved when outlier sessions were removed. The findings of the study confirmed the predictive value of behavioral features in electric mobility applications.

Vink et al. [122] analyzed a building-scale micro-grid in Tsukuba, Japan, to assess the one-year forecasting potential of solar PV generation and building electricity demand. They applied linear, nonlinear, and SVR approaches, finding that PV output was linearly related to solar irradiance, whereas purchased electricity (demand) followed a quadratic relation with temperature. For PV, SVR improved RMSE in 2015–2016, whereas linear regression performed better in 2017. For demand, the quadratic model consistently outperformed the linear specification. The combined models estimated electricity costs within approximately 8% of actual values.

In 2021, Gilanifar and Parvania [123] proposed a clustered multinode learning model with Gaussian processes (GPs) to predict flexible energy demand across multiple EV charging stations. Their dataset, which was gathered from 53 charging stations in Utah, included daily load curves and limited metadata per station. The model employed k-means clustering to identify similar stations and transferred shared structure via linear predictors while considering residual uncertainty with GPs. Implemented in R, the clustered multi-node learning with Gaussian process (CMNL-GP) method improved the RMSE by over 30% in low-data scenarios compared with the ARIMA and NN baseline models.

Taken together, these studies illustrate how modular, open-source tools can be combined to explore novel forecasting frontiers—whether addressing geospatial heterogeneity, data sparsity, or the integration of domain-external variables (e.g., clustered battery data or EV charging sessions). Accordingly, this domain serves as a proving ground to extend R’s capabilities beyond traditional energy forecasting paradigms, encouraging innovation in modeling practices and systems-level understanding. Each application area (e.g., UPS batteries, EV charging, or micro-grids) presents unique modeling challenges, ranging from high temporal variability and spatial heterogeneity to long-term policy sensitivity and data limitations, and these studies demonstrate how R’s extensive package ecosystem, transparency, and flexibility support a wide spectrum of analytical requirements.

Yet, research in this area contends with heterogeneous, irregular data—such as clustered or incomplete charging sessions—and nonstandard metrics, all of which complicate benchmarking [124]. Studies showed more stable performance across varying regimes [125]. Given the novelty of these tasks, transparent preprocessing and reproducible workflows are as important as raw accuracy for ensuring reusability [126]. Ultimately, R emerges not only as a statistical programming language but as a comprehensive platform that can be utilized to develop interpretable, adaptable, and context-sensitive energy forecasting models.

3. Review Design and Methodology

Our pipeline consists of four stages: corpus assembly, text preprocessing with topic and keyword modeling, evidence extraction for model–metric comparisons, and synthesis into domain-specific guidance and resources. Each stage is accompanied by clearly defined inclusion criteria, explicit preprocessing rules, and publicly available scripts to ensure full reproducibility. Initial search was conducted on 1 January 2020 and last updated on 15 August 2025. Duplicates were removed using title–year–DOI matching.

3.1. Text Mining Techniques: TF-IDF, Keyword Extraction, Topic Modeling

Given the methodological and domain diversity among the selected studies, text mining techniques were employed to identify latent patterns beyond what traditional thematic classification could reveal. In this stage, all computational procedures were performed entirely within the R environment.

3.1.1. Preprocessing and Stopword Strategy

First, each document was converted to plain text. The following preprocessing steps were applied [127,128,129].

Tokenization, lowercase removal, and punctuation removal;
Stopword removal using the stopwords package with the default English dictionary;
Domain-specific stopword enhancement, removing generic terms, e.g., forecast, energy, performance, prediction, set, and figure;
Lemmatization using the textstem and textclean packages for verb normalization and plural reduction.

These steps ensured lexical consistency across the documents and improved the semantic interpretability of the TF measures.

3.1.2. TF-IDF and Keyword Analysis

For the purpose of identifying representative terms within the corpus, TF-IDF analysis was conducted using the text2vec package. This technique enabled the extraction of keywords specific to forecasting strategies and modeling tools. The resulting TF-IDF matrix was then analyzed to identify the following.

Package names (e.g., ranger, xgboost, caret);
Modeling terms (e.g., “random forest” and “clear sky index”);
Application contexts (e.g., “PV output” and “smart meter”).

The extracted terms served as indicators of methodological trends and functioned as validation anchors for the subsequent manual domain classification.

3.1.3. Topic Modeling with LDA

In order to reveal latent themes across the full text corpus, LDA was implemented using the topicmodels package [130]. At this step, the number of topics was optimized by evaluating the coherence score [131]. The top-ranked terms within each topic were reviewed manually to assign semantic labels. The following representative topics were included.

Ensemble learning and tree-based modeling;
Short-term solar and wind forecasting;
Hybrid modeling and explainability;
Simulation-based optimization and energy planning;
Smart meter-driven building energy consumption analysis.

The LDA model offered an unsupervised mechanism to validate the forecasting domains and modeling strategies identified via the manual reviews. To aid interpretability, interactive visualizations were created using the LDAvis package, which facilitated intuitive exploration of topic–term relationships [132,133].

3.2. R Packages and Workflow Structure

A key contribution of this review lies in its consistent use of R packages to implement all analytical stages, ensuring full end-to-end reproducibility. The following packages were employed across the text mining and visualization workflows.

Preprocessing and Text Handling: stringr, textclean, textstem, and tm;
Stopword and Token Control: stopwords and SnowballC;
Text Analysis and Modeling: text2vec, topicmodels, and LDAvis;
Data Management and Plotting: dplyr, ggplot2, igraph, and wordcloud.

Moreover, all scripts were developed in RStudio and organized into modular functions (e.g., preprocessing, TF-IDF extraction, LDA fitting, and visualization), thereby enabling easy replication or extension by future researchers.

Beyond this, the structure of the analytical pipeline mirrors practices commonly observed in the reviewed literature, particularly in terms of the use of caret-style pipelines, cross-validation techniques, and interpretability tools [134]. This methodological alignment reinforces the thematic integrity of the review and enhances its practical value for readers engaged in similar forecasting research.

3.3. Reproducibility and Accessibility for Novice Users

Many readers may be new to energy forecasting and R programming; thus, special consideration was taken to ensure that all analytical procedures are transparent, well-documented, and fully reproducible, and the entire review process was performed using open-source R packages to avoid reliance on commercial platforms or proprietary code. The review workflow was designed to be modular and lightweight, thereby making it easily adaptable to other energy domains or forecasting objectives. Each methodological choice, from keyword filtering to topic assignment, is supported by reproducible code and quantitative metrics (e.g., coherence scores and TF-IDF values).

This emphasis on both accessibility and transparency reflects the ethos found in many of the reviewed studies, which similarly favored clarity over complexity. The use of publicly available datasets (e.g., the ENTSOG [135], MERRA-2 [136], and NYISO [137] datasets) and well-documented modeling frameworks ensures that other researchers can replicate, validate, or refine the findings of this review. As a result, this study synthesizes past R-based energy forecasting practices and provides a practical template to perform similar reviews in adjacent domains, e.g., water usage, urban infrastructure, and transportation demand management.

4. Results and Analysis

This section summarizes our text mining-based synthesis of 49 R-based energy forecasting studies, examining terminology usage, methodological patterns, and package adoption across domains such as solar, wind, building, thermal, and hybrid forecasting. By employing techniques such as TF–IDF, n-gram analysis, topic modeling, and related techniques, we identified common modeling strategies, co-occurring toolchains, and domain-specific vocabulary clusters. These results offer a structured view of R’s role in varied forecasting contexts, highlighting established methods alongside emerging trends.

In Section 4, we explicitly revisit the research questions and outlined contributions as we introduce the associated figures and tables. In particular, Table 8, Table 9, Table 10, Table 11, Table 12 and Table 13 and Figure 2, Figure 3, Figure 4, Figure 5, Figure 6, Figure 7, Figure 8, Figure 9, Figure 10, Figure 11, Figure 12, Figure 13, Figure 14, Figure 15, Figure 16 and Figure 17 collectively address RQ2 and RQ3. Figure 2, Figure 3, Figure 4, Figure 5, Figure 6 and Figure 7 highlight domain-specific method families and tool adoption patterns relevant to RQ2; Figure 8 provides corpus-level thematic context; and Figure 9, Figure 10, Figure 11, Figure 12, Figure 13, Figure 14, Figure 15, Figure 16 and Figure 17 map methods to application domains, and simultaneously identifying key performance conditions—such as cold-start scenarios, data sparsity, hierarchical modeling structures, and multi-horizon forecasting challenges.

Table 14 summarizes, for reference, how the collected studies are allocated across thematic domains. Table 15, Table 16, Table 17, Table 18, Table 19 and Table 20 then operationalize the comparative tiers and remedies, thereby addressing RQ3 and RQ4. For the purpose of enhancing reproducibility, we provide example scripts, workflow templates, and links to public code in our GitHub repository: https://github.com/hwkim24x/Beginner-friendly-Review-of-Research-on-R-based-Energy-Forecasting-Insights-from-Text-Mining (accessed on 25 August 2025).

4.1. Keyword Analysis: A Multi-Metric Approach

We employed a robust methodology to identify unique and representative keywords for each energy domain. Such an approach transcended simple frequency counts to detect words that were not only common but also contextually and semantically relevant. The final score for each word was a weighted combination of three key metrics:

TF-IDF: This metric identified words that were highly specific to a particular domain by balancing their frequency within that domain against their rarity across all domains. A high TF-IDF value indicated a word that was unique and characteristic of a specific field.
Semantic Similarity: We calculated the cosine similarity between the target keywords for a domain (e.g., solar, energy, forecasting) and every word in the corpus. This step guaranteed that the results were not just frequent but also semantically related to the domain’s core topic.
Normalized Frequency: This represented a word’s overall occurrence rate in the entire dataset, thereby confirming that the final score prioritized words that were not only relevant but also sufficiently present in the literature.

The following tables presented the top words for each domain, ranked by this comprehensive score. Our analysis identified a set of unique and highly relevant keywords for each of the six energy domains. The tables provided a structured overview of the most prominent terms.

Table 8. Keywords for the “Electricity Forecasting” domain.

Keyword	Score	Similarity	Frequency	Normalized Frequency	TF-IDF
ev	0.209328	0.239871	105	0.000542	0.015018
charge	0.183396	0.162672	354	0.001826	0.043744
emission	0.181779	0.205734	118	0.000609	0.013879
voltage	0.180466	0.211387	76	0.000392	0.009294
current	0.180030	0.203162	124	0.000640	0.013879
meter	0.173802	0.204787	78	0.000402	0.007435
ac	0.167188	0.199470	63	0.000325	0.005453
efficient	0.164434	0.196989	50	0.000258	0.005329
attack	0.164078	0.150967	210	0.001083	0.040587
generation	0.161833	0.155230	307	0.001584	0.027263
probabilistic	0.160991	0.183859	110	0.000568	0.010286

Table 8 shows the representative keywords for the electricity forecasting domain, illustrating a research area that had increasingly emphasized the integration of emerging technologies with grid management. The frequent occurrence of terms such as ev and charge indicated that considerable attention had been directed toward examining the implications of electric vehicle adoption for power grid operations—an issue closely tied to balancing future electricity demand. At the same time, voltage and current referred to the fundamental physical variables that were commonly modeled to maintain grid stability and operational efficiency. The relatively high score of emission demonstrated that environmental considerations had become an essential component of electricity generation and consumption studies. In addition, the presence of generation underscored the supply-side dimension of the field, while attack highlighted the ongoing concern with cybersecurity threats and the resilience of electricity infrastructures.

Table 9. Keywords for the “Energy Forecasting in Buildings” domain.

Keyword	Score	Similarity	Frequency	Normalized Frequency	TF-IDF
construction	0.285215	0.338146	76	0.000392	0.013471
architecture	0.258984	0.308039	84	0.000433	0.010273
residential	0.247132	0.259010	177	0.000913	0.038490
develop	0.216398	0.222819	329	0.001697	0.026462
environment	0.209687	0.249609	67	0.000346	0.008179
make	0.204432	0.241698	99	0.000511	0.007457
outdoor	0.201636	0.220796	119	0.000614	0.023575
construct	0.196122	0.216715	150	0.000774	0.018764
implement	0.193735	0.225577	123	0.000635	0.008660
achieve	0.192301	0.213553	166	0.000856	0.016118
development	0.186287	0.207755	158	0.000815	0.014915
driven	0.181350	0.182244	151	0.000779	0.034705
environmental	0.181041	0.212660	122	0.000629	0.005773
ambient	0.179114	0.217360	64	0.000330	0.002406

Table 9 shows the representative keywords for building energy forecasting, illustrating a domain that had concentrated on the entire lifecycle of buildings—from initial design through operational phases. The occurrence of terms such as construction and architecture indicated that many studies had investigated the design stage, where energy consumption was estimated on the basis of a building’s structural and material characteristics. By comparison, keywords such as develop and implement reflected the application of forecasting models in existing buildings, where operational data and real-time environmental inputs were incorporated to enhance predictive accuracy. The prominence of environment and ambient additionally indicated that external climatic and contextual factors had been actively considered in model development. Taken together, this domain illustrates a transition from static design-based analysis toward dynamic, data-driven optimization strategies, with particular attention to challenges such as the “cold-start problem” faced by newly constructed facilities.

Table 10. Keywords for the “Solar Energy Forecasting” domain.

Keyword	Score	Similarity	Frequency	Normalized Frequency	TF-IDF
irradiance	0.216928	0.244595	105	0.000542	0.019794
satellite	0.205417	0.238808	71	0.000366	0.013385
year	0.193504	0.177945	343	0.001770	0.041814
horizon	0.190417	0.204654	216	0.001114	0.019437
reflect	0.184566	0.222379	59	0.000304	0.004524
benchmark	0.178263	0.211785	75	0.000387	0.006207
program	0.169531	0.196435	83	0.000428	0.010127
dhn	0.164947	0.189356	68	0.000351	0.012442
compute	0.164900	0.190852	95	0.000490	0.009147

Table 10 shows the representative keywords for solar energy forecasting, demonstrating a field that had relied heavily on meteorological observations and remote sensing data. The top-ranked term, irradiance, was identified as the core physical variable modeled in this domain, because it captured the solar radiation that directly drove energy generation. The high score for satellite indicated that remote sensing data had been extensively employed to estimate ground-level irradiance, especially in areas with limited sensor coverage. The keyword horizon highlighted one of the primary methodological challenges: forecasting across multiple temporal scales, ranging from short-term intraday predictions of cloud movement to long-term seasonal planning. In addition, the presence of terms such as compute and benchmark indicated that researchers had also prioritized computational efficiency and the establishment of standardized criteria for evaluating forecasting models.

Table 11. Keywords for the “Wind Energy Forecasting” domain.

Keyword	Score	Similarity	Frequency	Normalized Frequency	TF-IDF
window	0.197027	0.212036	191	0.000985	0.021930
air	0.183394	0.202018	164	0.000846	0.016648
tune	0.180526	0.215338	71	0.000366	0.005741
vector	0.168035	0.156649	333	0.001718	0.031919
layer	0.153739	0.134620	369	0.001904	0.033756
speed	0.145883	0.138992	254	0.001310	0.026982
error	0.145757	0.048831	852	0.004396	0.078420

Table 11 shows the representative keywords for wind energy forecasting, demonstrating a domain that had depended heavily on spatio-temporal modeling and atmospheric science. The keyword window referred to the use of sliding or rolling time series windows designed to capture short-term fluctuations in wind speed and power—an element that was critical for real-time grid operation. The terms air and vector revealed that atmospheric variables, including wind speed and direction, had been employed as core predictors in forecasting models. The occurrence of tune suggested that researchers had placed significant emphasis on hyperparameter tuning and model optimization to enhance predictive performance. Ultimately, the prominence of error underscored one of the most persistent challenges in this field: the inherent variability of wind, thereby making accurate forecasting difficult and requiring a continual focus on minimizing and interpreting forecast errors.

Table 12. Keywords for the “Thermal and Gas Energy Forecasting” domain.

Keyword	Score	Similarity	Frequency	Normalized Frequency	TF-IDF
simulation	0.222369	0.257856	128	0.000660	0.011851
hot	0.222209	0.266817	53	0.000273	0.007527
policy	0.213928	0.256777	73	0.000377	0.005926
simulate	0.185140	0.214539	88	0.000454	0.011210
teg	0.182603	0.211821	59	0.000304	0.012628
humidity	0.172169	0.180067	165	0.000851	0.024503
statistical	0.167789	0.170755	227	0.001171	0.024343
neural	0.164657	0.147886	357	0.001842	0.034913

Table 12 shows the representative keywords for thermal and gas energy forecasting, demonstrating a domain that had drawn heavily on computational simulations and statistical modeling. The top-ranked terms, simulation and simulate, showed that researchers had frequently relied on scenario-based modeling to analyze heat and gas energy flows under varying operational conditions. The presence of hot and humidity implied that environmental and climatic parameters had been incorporated into forecasting frameworks to capture real-world variability. At the same time, keywords such as statistical and neural underscored the coexistence of conventional statistical methods and modern machine learning approaches in tackling complex datasets and improving predictive accuracy. Ultimately, the occurrence of teg (thermoelectric generator) highlighted the field’s attention to advanced conversion technologies, signifying an interest in the integration of thermal and gas energy systems with emerging energy-efficient devices.

Table 13. Keywords for the “Hybrid and Emerging Energy Systems” domain.

Keyword	Score	Similarity	Frequency	Normalized Frequency	TF-IDF
customer	0.188964	0.064885	207	0.001068	0.157971
gmdh	0.184950	0.130489	113	0.000583	0.093412
mse	0.177369	0.196952	142	0.000733	0.015602
gep	0.168589	0.076654	142	0.000733	0.124924
general	0.163361	0.184761	120	0.000619	0.011702
mini	0.162735	0.100303	102	0.000526	0.096538
short	0.154458	0.078475	401	0.002069	0.088737
sector	0.153263	0.168182	136	0.000702	0.014627
production	0.143594	0.081612	293	0.001512	0.078985
survey	0.141864	0.117499	81	0.000418	0.054607
vfc	0.141627	0.172648	68	0.000351	0.000000
term	0.138230	0.033403	740	0.003818	0.091662
boost	0.133315	0.152933	137	0.000707	0.004876

Table 13 shows the representative keywords for the hybrid and emerging energy systems domain, illustrating its dual emphasis on behavioral dynamics and technological innovation. The prominence of customer and survey revealed that forecasting in this field had often been shaped by patterns of human behavior and energy consumption, which were more variable and difficult to predict than the meteorological drivers common in other domains. Notably, areas such as EV charging and ESS operation required models that accounted for user-driven variability. The high scores of gmdh (Group Method of Data Handling) and gep (Gene Expression Programming) showcased the reliance on symbolic regression techniques to uncover nonlinear relationships and to provide interpretable models, distinguishing the methodology from purely black-box approaches. Moreover, the occurrence of terms such as production and sector signaled that this domain had also considered broader industrial and policy contexts, integrating technological advances with demand-side behaviors. Taken together, hybrid and emerging energy forecasting illustrates a methodological shift toward uncovering hidden behavioral patterns while maintaining model interpretability for real-world deployment.

4.2. Frequency Analysis

This section analyzed the most frequently occurring terms across research papers on various energy forecasting domains. The procedure was carried out in two distinct stages. At the initial stage, a broad scan of all documents revealed commonly used, generic terms such as model, data, and energy. These items were subsequently defined as custom stopwords thereby ensuring that they would not overshadow domain-specific patterns. Subsequently, raw term frequencies were calculated separately for each forecasting category. As a result, we were able to identify the dominant themes and methodological tendencies that characterized each field. Figure 2, Figure 3, Figure 4, Figure 5, Figure 6 and Figure 7 below illustrate the twenty most frequent words per subset, offering a comparative view of how research priorities and focal points differed across energy forecasting domains.

Figure 2. Top twenty words for the “Electricity Forecasting”.

Figure 2 shows the twenty most frequent terms appearing in the “Electricity Forecasting” subset, identified through raw frequency counts. The dominance of load, electricity, and consumption revealed that prior studies had concentrated on short-term demand estimation and consumption modeling. The presence of the methodological marker regression underscored the incorporation of statistical approaches within this body of work. Moreover, the frequent appearance of error, accuracy, and test demonstrated a strong emphasis on model evaluation and performance verification. Ultimately, operationally oriented terms such as customer, power, and level pointed to sustained interest in forecasting at multiple layers of the grid, ranging from individual customer usage to system-wide demand across hierarchical levels.

Figure 3. Top twenty words for the “Energy Forecasting in Buildings”.

Figure 3 shows the twenty most frequent terms in the “Building Forecasting” subset, derived from raw frequency calculations. The prominence of consumption, build, and building showed that research in this area had primarily addressed the challenge of modeling energy demand at the building level. Temporal and environmental markers such as day, load, and temperature emphasized the importance of short-term prediction horizons and the integration of weather-sensitive variables. Methodological keywords—including train, network, and process—signaled that neural networks, training procedures, and diverse analytical pipelines had been widely applied. In conclusion, performance-related terms such as mape, error, and accuracy reinforced the persistent emphasis on quantitative assessment of forecasting reliability.

Figure 4. Top twenty words for the “Solar Energy Forecasting”.

Figure 4 shows the twenty most frequent terms in the “Solar Forecasting” subset, obtained from raw frequency analysis. The dominance of solar, distribution, and ensemble indicated that this research stream had concentrated on solar resource prediction and had made extensive use of ensemble-based modeling approaches. The presence of irradiance, meteorological, and satellite pointed to the reliance on solar radiation measurements and the integration of atmospheric and remote-sensing information. Methodological markers such as algorithm, train, learn, and technique reflected that diverse machine learning and data-driven approaches had been actively employed. Additionally, evaluation-oriented terms including bias and ghi illustrated the use of specific metrics and irradiance inputs for performance validation. At last, the occurrence of ground and station signified the importance of on-site measurements as complementary data sources for improving forecasting accuracy.

Figure 5. Top twenty words for the “Wind Energy Forecasting”.

Figure 5 shows the twenty most frequent terms in the “Wind Forecasting” subset, obtained from raw frequency counts. The prominence of wind, farm, and power showed that prior research had emphasized forecasting wind-generated electricity at both the farm scale and broader regional levels. The occurrence of lr and gb signaled the methodological diversity of modeling strategies employed in this field. Keywords such as speed and turbine underscored the central role of wind velocity and turbine-specific characteristics in determining forecasting accuracy. Evaluation-related terms—including error, accuracy, and temporal—reinforced the persistent concern with quantitative assessment and temporal stability in model outputs. Ultimately, the presence of reconciliation and hierarchy revealed the importance of system-level integration, particularly the alignment of forecasts across hierarchical layers of the power grid.

Figure 6. Top twenty words for the “Thermal and Gas Energy Forecasting”.

Figure 6 shows the twenty most frequent terms in the “Thermal and Gas Energy Forecasting” subset, derived from raw frequency counts. The prominence of heat, load, and temperature evidenced that forecasting in this domain had largely focused on thermal demand estimation for heating and gas systems. The occurrence of build, building, and consumption emphasized the significant influence of physical structures and user consumption patterns on thermal and gas load prediction. Methodological terms such as machine, network, and regression reflected that researchers had employed both traditional statistical approaches and advanced ML techniques. Performance-related words, including error and accuracy, illustrated the strong emphasis placed on rigorous evaluation of predictive reliability. In summary, the presence of heat, cool, and analysis pointed to efforts to account for seasonal variability alongside comprehensive assessments of system efficiency.

Figure 7. Top twenty words for the “Hybrid and Emerging Energy Systems”.

Figure 7 shows the twenty most frequent terms in the “Hybrid and Emerging Energy Forecasting” subset, derived from raw frequency analysis. The prominence of charge, cluster, and battery showed that research in this domain had concentrated on predicting the charging behavior and state of EV batteries. The occurrence of demand, station, and session revealed the attention given to charging infrastructure utilization and user-specific patterns of energy consumption. Methodological markers such as regression and algorithm signaled that both statistical models and ML approaches had been actively applied. Performance-related keywords, including error, confirmed the emphasis on systematic model evaluation. Taken together, the presence of solar, electricity, and load underscored the close linkage between EV charging patterns and the broader energy grid, particularly the influence on aggregate power demand.

4.3. Word Cloud

A word cloud was created using raw term frequency in order to provide an intuitive visualization of the dominant concepts within the processed corpus. The dataset was prepared through standard preprocessing steps—tokenization, lowercasing, and stopword removal—while ensuring only the content from the title, abstract, introduction, main body, and conclusion. Nonsubstantive sections, such as references, acknowledgments, tables, and figure captions, were excluded to ensure that the visualization captured only the core material of the studies. The resulting word cloud offers a thematic overview of the research landscape, as illustrated in Figure 8.

Prominent terms such as model, forecast, energy, and data showed that the field’s central orientation toward predictive analytics. The strong presence of words including building, heat, solar, and wind underscored the major application domains examined in the literature. Methodological markers—method, algorithm, network, learn, and prediction—implied that researchers had employed a wide spectrum of modeling strategies. In addition, the appearance of evaluation-related terms such as error and accuracy confirmed the consistent priority placed on rigorous model validation. Overall, the word cloud made clear that the reviewed studies had largely centered on the development, assessment, and deployment of diverse forecasting approaches to address energy-related challenges across multiple domains.

Figure 8. Corpus-wide keyword word cloud.

4.4. Bigram Analysis

A bigram analysis was conducted to capture more detailed semantic patterns by extracting frequently co-occurring two-word combinations. The most prominent bigrams included consumption based, solar power, load forecasting, wind speed, and ensemble model. These paired terms revealed concrete research directions, such as demand-oriented approaches (consumption based), renewable energy generation modeling (solar power and wind speed), and methodological innovations (ensemble model). The recurrent presence of ensemble-related expressions highlighted the growing tendency in energy forecasting research to adopt algorithmic diversity and hybrid modeling strategies.

4.5. LDA Topic Modeling

LDA was conducted to uncover the major research themes within the corpus, and the analysis identified eight topic clusters: Consumption and Forecasting Models, Power Grid and System Modeling, Load and Consumption Forecasting, Power Demand and Grid Management, Wind and Renewable Energy Forecasting, Short-Term Forecasting and Optimization, Electricity Consumption and Statistical Models, and Solar and Weather Forecasting. These thematic clusters confirm the diversity of energy forecasting research and show that computational text mining with R has been effective in structuring the domain-specific literature while complementing manual analysis.

Figure 9, Figure 10, Figure 11, Figure 12, Figure 13, Figure 14, Figure 15, Figure 16 and Figure 17 summarize the LDA topic modeling outcomes, providing a systematic account of the major research themes and their interrelations [132,133]. In particular, Figure 9 depicts the overall topic map, while Figure 10, Figure 11, Figure 12, Figure 13, Figure 14, Figure 15, Figure 16 and Figure 17 delineate the eight identified topics by listing their most salient terms. Taken together, these figures clarify how research priorities are distributed across domains, capturing both recurring methodological patterns and domain-specific focal points. The visualization style follows established practices in topic modeling research, such as those proposed by Chuang et al. (2012) [132] and Sievert and Shirley (2014) [133], which are cited here for reference.

Figure 9. Topic map of the reviewed energy forecasting papers (LDA, K = 8; LDAvis, λ = 1.0). The left panel presents the intertopic distance map based on Jensen–Shannon divergence, with bubble sizes proportional to topic prevalence, while the right panel lists the top 30 salient terms for the selected topic (Chuang et al. (2012) [132] and Sievert and Shirley (2014) [133]). Together, these visualizations provide a structured overview of how the topics were distributed and interconnected.

Figure 10. Topic 1—Consumption and Forecasting Models. This topic was characterized by terms such as build, learn, consumption, and train. These terms indicated that studies had focused on employing machine learning algorithms (e.g., regression, neural networks) to predict energy loads (Chuang et al. (2012) [132] and Sievert and Shirley (2014) [133]). The appearance of mape, test, and error suggested a strong emphasis on evaluating model accuracy and validation.

Figure 11. Topic 2—Power Grid and System Modeling. This topic was defined by heat, load, temperature, and day. The co-occurrence of demand, electricity, and network highlighted research concerned with the relationship between thermal conditions and electricity consumption at the grid or network level. Methodological terms such as algorithm, error, and neural demonstrated the integration of statistical and computational approaches to capture system behavior (Chuang et al. (2012) [132] and Sievert and Shirley (2014) [133]).

Figure 12. Topic 3—Load and Consumption Forecasting. This topic was characterized by consumption, renewable, analysis, and economic. These terms indicated a focus on forecasting energy demand and load profiles. The inclusion of regression, linear, and pattern suggested reliance on statistical models, while the presence of emission, fossil, and source revealed that environmental aspects of energy generation had been actively examined (Chuang et al. (2012) [132] and Sievert and Shirley (2014) [133]).

Figure 13. Topic 4—Power Demand and Grid Management. The keywords cluster, charge, and load indicated research on power demand forecasting and grid management, often linked to battery usage and customer-level behavior. The presence of dataset, error, and test highlighted a methodological emphasis on building predictive models and rigorously evaluating their performance (Chuang et al. (2012) [132] and Sievert and Shirley (2014) [133]).

Figure 14. Topic 5—Wind and Renewable Energy Forecasting. These terms, wind, power, load, and farm, indicated that the topic centered on wind energy prediction and optimization. Additional keywords such as probability, dataset, and algorithm demonstrated that statistical and computational methods had been frequently employed to improve forecasting accuracy (Chuang et al. (2012) [132] and Sievert and Shirley (2014) [133]).

Figure 15. Topic 6—Short-Term Forecasting and Optimization. This topic was characterized by consumption, temporal, level, and electricity. These keywords suggested that this topic addressed short-term forecasting tasks. The presence of build, perform, and strategy revealed that many studies had emphasized constructing and optimizing models for near-term energy prediction (Chuang et al. (2012) [132] and Sievert and Shirley (2014) [133]).

Figure 16. Topic 7—Electricity Consumption and Statistical Models. This topic was defined by mape, hybrid, weight, and average. The frequent occurrence of consumption and electricity indicated a sustained focus on forecasting electricity use. Keywords such as arima, arma, and neural demonstrated that both classical statistical methods and neural network approaches—often in hybrid forms—had been employed to enhance predictive accuracy (Chuang et al. (2012) [132] and Sievert and Shirley (2014) [133]).

Figure 17. Topic 8—Solar and Weather Forecasting. The key terms city, distribution, and consumption were prominent. The co-occurrence of weather, solar, and irradiance highlighted research that relied heavily on meteorological data for solar power forecasting. The appearance of station and satellite revealed that geographical observations and remote-sensing inputs had been central components in model development (Chuang et al. (2012) [132] and Sievert and Shirley (2014) [133]).

4.6. Emerging Application Domains

The following thematic domains emerged during the study period. Table 14 provides a quick headcount of how often each topic appears in the corpus, allowing readers to gauge at a glance where recent scholarly attention has been concentrated.

Table 14. Study distribution by emerging thematic domain.

Thematic Domain	Articles (n)
Electricity Forecasting	12
Building Energy Demand	14
Solar Energy Forecasting	6
Wind Power Forecasting	4
Thermal and Gas Forecasting	9
Mixed-Energy and Emerging Topics	4

Electricity Forecasting: Twelve studies address grid-level forecasting across feeder, regional, and national scales. The predominant models comprise tree-based ensembles (RF, GBM, and XGBoost), probabilistic deep learners (D-MCQRNN), and hierarchical reconciliation methods (MinT) applied to ARIMA/ETS baselines. According to Table 15, these approaches exhibit distinct strengths: tree ensembles consistently outperform ARIMA/ETS in point accuracy, deep quantile networks excel in probabilistic calibration, and MinT remains the preferred option for achieving multi-level forecast coherence. Model selection is inherently application-driven, spanning operational planning (e.g., peak shaving and tariff setting) to demand–response scheduling.

Table 15. Representative models, relative performance tiers, and empirical evidence for electricity load forecasting across feeder-to-country scales.

Representative Models	Relative Performance	Evidence and Notes
D-MCQRNN (deep quantile NN) [57]	High (probabilistic)	Improved CRPS by up to 26.76% vs. QRF/Bayesian NN/LSSVM baselines
Hierarchical reconciliation (MinT) [53]	High (coherence)	Lowest MAPE/RMSE after reconciling ARIMA/ETS across the national hierarchy
ARIMA/ETS [53]	Medium (baseline)	Widely used; accuracy improves when paired with MinT reconciliation
UBXII-ARMA (bounded probabilistic) [59]	Medium–High	Beat beta-ARMA/KARMA; detects structural breaks in stored energy share
LightGBM/XGBoost (with log-diff.) [63]	High (volatile markets)	Best 6-month forecasts with robust rolling windows

Building Energy Demand: A number of studies focused on HVAC load, lighting, and occupancy-based demand forecasting using smart meter and building management system data. Here, popular models included RF/GBM (incl. GBDT) and ensemble regressor models. Evidence from Table 16 indicates that interpretable tree–linear hybrids consistently achieve strong performance in daily peak and total load forecasting. Online methods (RABOLA) reduce performance degradation under regime shifts, while transfer-learning frameworks (SPROUT) address data scarcity in newly monitored buildings. Data augmentation (CVAE) yields moderate to substantial gains, and physics–ML hybrids provide scalable, accurate extrapolation.

Table 16. Representative models, relative performance tiers, and empirical evidence for building-scale energy demand forecasting using smart meter and BMS data.

Representative Models	Relative Performance	Evidence and Notes
RF/GBM [82]	High	Often top for STLF and facility data; e.g., GBDT is best on industrial hourly series
Online RF (RABOLA) [78]	High (under drift/regime shift)	Online updates stabilized STLF
Transfer/cold-start (SPROUT) [72]	High in data-scarce	Transfer from similar buildings and calendar adjustment beat 14 baselines
Augmentation (CVAE) [75]	Medium to High boost	12–18% CVRMSE improvement across 52 buildings
Deep polynomial NN (DPNN) [81]	High (high-frequency demand)	Lower NRMSE, faster than symbolic GEP under COVID-19 features

Solar Energy Forecasting: Different studies modeled GHI and PV output using various models, e.g., the GAMLSS, RF, and XGBoost models. In addition, the preprocessing steps frequently included weather classification and clear sky index estimation, with SURFRAD/BSRN data and bias-corrected satellite GHI. As reported in Table 17, tree-rule models achieve the highest point accuracy, particularly under challenging seasonal conditions. Probabilistic approaches (GAMLSS and QRF) markedly improve calibration, while combining weather-regime classification with ensemble learners enhances robustness in variable cloud cover.

Table 17. Representative models, relative performance tiers, and empirical evidence for solar energy forecasting, including PV output and irradiance estimation.

Representative Models	Relative Performance	Evidence and Notes
Probabilistic post-processing (GAMLSS and QRF) [88]	High (calibration)	Large CRPS skill gains (up to ~58%) on CSI distributions
Cubist/RF (point) [87]	High (point accuracy)	BSRN 15 stations; bias-corrected univariate ML comparable to ground obs.
QGAM/QRRF [92]	High (probabilistic/robust)	Windhoek case; improved RMSE/CRPS/pinball loss, DM test nonsignificant
Bias-corrected satellite GHI [87]	High (data-sparse regions)	Satellite-to-ground bias correction achieves practical accuracy where sensors absent

Wind Power Forecasting: The ARIMA, SARIMA, RF, and analog-based models were employed for short-term predictions. One study simulated FDIAs using Monte Carlo methods to test model resilience. Findings summarized in Table 18 show that analog ensembles (DTSF/eDTSF) achieve superior multi-step accuracy, MinT enforces cross-resolution coherence, and EOLO integrates forecast precision with profit optimization.

Table 18. Representative models, relative performance tiers, and empirical evidence for wind power forecasting across short-term and operational contexts.

Representative Models	Relative Performance	Evidence and Notes
DTSF/eDTSF (analog) [99]	High (multi-step)	Analog ensembles outperformed 11 baselines on 241k obs.
Hierarchical reconciliation (MinT) [101]	High (temporal coherence)	Restores consistency across 5-min to hourly horizons
EOLO (8-model ensemble) [100]	High (operational)	Error minimization and profit optimization; no manual site configuration
RF/SVM [98]	Medium–High (low-data/terrain)	SVM robust under FDIA; both outperform ARIMA/SARIMA in short-term settings
SARIMA/TBATS (baseline)	Medium (select horizons)	Competitive at specific horizons; retains transparency for seasonal dynamics

Thermal and Gas Forecasting: Studies on CHP systems, natural gas consumption, and DHNs applied hybrid models, regression-based ANNs, and physics-informed machine learning. Insights from Table 19 demonstrate that SVR models augmented with EnergyPlus-derived features consistently achieve top accuracy at both individual and aggregate levels, offering substantial computational efficiency over pure simulations. GAMM with AR errors stabilizes off-season DH loads and highlights dominant calendar effects, while SVR with EnergyPlus features tops both individual/aggregate accuracy at >1000× sim-speedup; forecastHybrid ensembles provide transparent monthly gas forecasts.

Table 19. Representative models, relative performance tiers, and empirical evidence for thermal and gas energy demand forecasting.

Representative Models	Relative Performance	Evidence and Notes
SVR (poly/gaussian) with EnergyPlus features [108]	High	Building-stock study: SVR variants top at both individual and aggregate levels; >1000× faster than pure simulation
ANN (feedforward) [110]	Medium–High	The CHP heat-demand model beat the exponential baseline (RMSE/MAE/R metrics)
forecastHybrid ensembles (EW/CVW) [112]	High (monthly series)	CVW is best for industrial (MAPE ≈ 3.19%), EW is best for vehicle fuel (≈5.40%), it beats standalone DL

Mixed-Energy and Emerging Topics: Studies in this domain addressed heterogeneous and nonstandard forecasting targets, from total generation across multiple energy types to stored energy dynamics. As outlined in Table 20, XGBoost and CMNL-GP dominate EV session and sparse-station forecasting; cluster-assisted ARIMA improves UPS battery volatility handling; bounded-support models address constrained targets.

Table 20. Representative models, relative performance tiers, and empirical evidence for hybrid and emerging energy forecasting applications.

Representative Models	Relative Performance	Evidence and Notes
XGBoost [121]	High	XGBoost is best for EV-session demand
CMNL-GP (clustered GP for EV stations) [123]	High in low-data	>30% RMSE reduction vs. ARIMA/NN in sparse-station settings
Cluster-assisted ARIMA (k-shape) [120]	High (irregular clusters)	UPS battery data; improved RMSE/MAE/MAPE vs. DTW-clustered and single predictors
RF (policy share) [114]	Medium–High (policy signals)	EU-27 fossil fuel share forecasting; SHAP/PDP improved interpretability

Taken together, these domains underscore R’s adaptability across diverse and sector-specific forecasting tasks, while also highlighting its utility for generating insights relevant to both infrastructure management and policy planning.

4.7. Methodological Trends Identified via Text Mining

In addition to the domain-specific insights discussed previously, this study also systematically analyzed the broader methodological evolution using text mining techniques applied to the full corpus of 49 studies. Figure 18 depicts how modeling strategies in R-based energy forecasting have evolved annually, highlighting major methodological transitions and research focus areas between 2020 and 2025.

4.7.1. Annual Growth

During 2020, the corpus featured foundational studies that applied classical statistical time series models to electricity demand and hierarchical generation data. These early works primarily employed the ARIMA, ETS, and seasonal decomposition approaches using various R packages, e.g., the forecast and hts packages for univariate and multivariate time series analysis, with hierarchical reconciliation—especially MinT—emerging as the most accurate strategy for national or sectoral aggregation.

By 2021–2022, a growing number of studies began incorporating real-time environmental variables and multifeature input spaces. During this period, ensemble ML techniques increasingly displaced purely statistical models, signaling a shift in methodological preference (e.g., LASSO/RF for rural mini-grids; transfer-learning approaches such as SPROUT for cold-start building STLF; online/stacked ensembles for industrial loads; and CVAE-based data augmentation for small samples). In parallel, R was also adopted for building-specific STLF, solar irradiance prediction, and wind power estimation. These applications frequently employed weather data, satellite-based reanalysis datasets, and spatially distributed input features, as well as rooftop/EV and hospital/industrial use cases.

During 2023–2024, R became a widely adopted platform for hybrid modeling and online learning workflows. Complex analytical pipelines combining simulation models (e.g., EnergyPlus) with regression-based learners (e.g., Cubist and GEP) were implemented using the R environment, alongside probabilistic and uncertainty-aware models such as D-MCQRNN for load, GAMLSS/QRF and QGAM/QRRF for solar, and hierarchical LSTM (bottom-up) for industrial electricity. Concurrently, the research topics were diversified to include thermal energy forecasting, energy-aware EV charging prediction, and probabilistic forecasting under uncertainty (e.g., GAMM for district heating off-season, XGBoost for EV sessions, analog eDTSF and market-aware EOLO for wind, and hierarchical MinT reconciliation across temporal resolutions). This methodological expansion signifies increasing confidence in R as a tool for data analysis and as a production-level platform for advanced energy forecasting.

By early 2025, R’s role as a primary analytical environment in both academic and applied energy forecasting had been solidified, and it was particularly prominent in several domains, including microgrid optimization, solar PV integration, and district heating systems, with boosting/interpretability (LightGBM/XGBoost with SHAP/PDP), bounded-series methods (UBXII-ARMA), and mixed-frequency pipelines (MIDAS-NAMEMD-QRNN) supporting sustainability- and policy-oriented analyses. These studies underscored the development of modular, reproducible forecasting pipelines capable of accommodating spatial heterogeneity and temporal granularity, thereby making evident R’s evolving role in sophisticated energy analytics and decision support.

4.7.2. Keyword Shifts

A diachronic analysis of the keywords from 2020 to 2025 extracted using TF-IDF revealed significant shifts in R-based energy forecasting research focus, data modalities, and algorithmic strategies. By filtering out generic or constant terms, e.g., “forecast,” “time,” and “prediction,” this analysis identified distinct annual transitions that signaled technological advancements and domain expansion.

During 2020, the field was characterized by a foundational emphasis on building-level energy modeling, with the most prominent keywords appearing as “building,” “consumption,” and “power.” Core methods also included “ARIMA,” “ETS,” and “MinT.” Additionally, other terms, e.g., ensemble (RF/GBM), SVM, GMDH, and attack (FDIA), suggest an early exploratory phase involving diverse ML algorithms and cybersecurity concerns, including false data injection in smart grids, and the presence of the vehicle, charging, and electric (EV) keywords revealed the early emergence of EV-related forecasting, particularly within urban energy studies.

In 2021, the field also exhibited signs of methodological consolidation and diversification. Notable keywords included tree, performance, XGB, and GBM, reflecting the growing adoption of ensemble learning (particularly GBM) for performance optimization. The appearance of district, cooling, and CHP represents increased interest in district energy systems. Moreover, the appearance of other terms, e.g., CVRMSE and RMSE, emphasized an increased focus on evaluation metrics and predictive accuracy, while “transfer learning” (SPROUT) and “symbolic regression” (GEP/GMDH) gained traction as novel strategies.

By 2022, modeling vocabularies had shifted toward solar energy forecasting and hybrid modeling frameworks, and the dominant keywords included solar, gas, input, RBM, and MCQRNN, pointing to the expansion of the methodological approaches. Furthermore, the appearance of other keywords, including sensor, feature, test, and synthetic, underscored the adoption of advanced data acquisition techniques, e.g., satellite and sensor fusion, and the use of synthetic datasets for generalization. The early diffusion of DL techniques was evident in terms like neural and augmentation, together with “stacking,” “ridge,” and “online” for industrial/building STLF.

In 2023, the field moved toward probabilistic modeling and environmental integration, and prominent keywords, e.g., quantile (QR/QRNN), ARMA/UBXII-ARMA, longwave, radiation, and temperature denoted a transition toward weather-integrated and uncertainty-aware forecasting approaches (e.g., D-MCQRNN, GAMLSS/QRF). Hybrid, hierarchy, and conditional approaches also rose in prominence, reflecting the interest in multimodel blending and hierarchical reconciliation strategies, including bottom-up LSTM and MinT across time scales. The persistent use of wind, MAPE, and MLR also confirmed that wind energy forecasting remained central, increasingly supplemented by statistical and ensemble methods (e.g., DTSF/eDTSF, EOLO).

By 2024, cross-domain application and practical deployment became prominent. Here, specific keywords, e.g., electricity, heat, hotel, urban, grid, and farm, indicated the application of forecasting models to real-world infrastructures, including urban areas and renewable energy farms. At the same time, the terms network, algorithm, hierarchical, and approach made evident the adoption of more complex model architectures and the emerging standardization of R-based workflows, including QGAM/QRRF for solar, GAMM for district heating, and market-coupled wind ensembles (EOLO).

By 2025, the emphasis shifted decisively toward sustainability, uncertainty quantification, and policy relevance, and different keywords, e.g., renewable, GDP, environmental relationship, EU, and emission, pointed to direct engagement with global energy transitions and climate policy. Algorithmic vocabulary further expanded to include LightGBM, XGBoost, DNN/MLP, LASSO, and SHAP/PDP, while the rising frequency of uncertainty, sustainability, impact, and economic underscored the move from pure prediction to decision-support systems integrated with societal outcomes. These advancements were reinforced by mixed-frequency and bounded-series methods (e.g., MIDAS-NAMEMD-QRNN, UBXII-ARMA).

This year-by-year evolution of the shifts in keywords traces a trajectory from foundational building-level modeling to data-rich, cross-domain, interpretable, and policy-aware forecasting systems. Each phase not only embodies methodological advancement but also reflects the increasing recognition of energy forecasting’s intersection with environmental policy, infrastructure planning, and societal impact, particularly within the reproducible and modular workflows of the R ecosystem.

4.7.3. Evolution of Modeling Strategies

The modeling techniques employed in the reviewed studies illustrate a distinct three-phase evolution in terms of methodological complexity and application scope.

Phase 1: Classical Time Series Methods. In this initial phase, statistical models, e.g., the ARIMA, ETS, and seasonal decomposition models, were predominant. These methods were primarily applied to national electricity generation, industrial consumption, and residential natural gas demand. The commonly used R packages included forecast, hts, and tsibble. These models offered high interpretability and ease of implementation; however, their capacity to accommodate exogenous variables or capture nonlinear relationships was limited.
Phase 2: ML and Tree-Based Ensembles. As access to high-resolution and multimodal data improved, ML methods emerged as the dominant approach, and various algorithms, e.g., the RF, XGBoost, and GBM algorithms (implemented via the ranger, xgboost, and gbm packages, respectively), became widely adopted. In addition, in this phase, many studies utilized hybrid feature sets that combined temporal signals with meteorological, spatial, and behavioral variables, and tools like caret and Boruta were commonly used for model tuning and feature selection. This phase resulted in substantial performance gains in complex domains, e.g., solar energy forecasting and building-level STLF.
Phase 3: Hybrid, Explainable, and Real-Time Learning. In recent years, a methodological shift toward hybrid, interpretable, and adaptive modeling has been observed. For example, studies have implemented Cubist models, which combine DT partitioning with linear regression at terminal nodes, as well as symbolic regression techniques, e.g., the GMDH and GEP methods, to extract human-readable rules. Online learning frameworks (e.g., RABOLA) and real-time retraining pipelines were employed to handle evolving energy system dynamics. In addition, explainability was prioritized using tools like PDPs, feature importance rankings, and rule-based visualizations to support operational decision-making processes and policy-facing applications.

This three-stage evolution highlights a broader methodological trend in energy forecasting, i.e., the movement from purely statistical modeling to data-rich, interpretable, and adaptive ML pipelines supported by R’s robust ecosystem.

5. Discussion and Practical Implications

Building on the findings in Section 4, this discussion synthesizes key insights, emphasizing their practical relevance for new practitioners. It addresses RQ5 by examining both the types of resources and the contexts in which they are most beneficial. The subsections first define the beginner audience (Section 5.1), then evaluate resource categories with illustrative examples (Section 5.2), and finally propose actionable recommendations for fostering accessible and reproducible workflows (Section 5.3).

5.1. Defining the Beginner Audience

Energy forecasting poses considerable challenges for beginners, particularly those without formal training in data science or programming; however, the R programming language, with its extensive ecosystem of user-friendly packages and reproducible workflows, offers a promising entry point for domain practitioners, graduate students, and interdisciplinary researchers [134]. The following discussion introduces a structured, beginner-oriented framework to perform energy forecasting projects in R grounded in the Cross-Industry Standard Process for Data Mining (CRISP-DM) and informed by the patterns observed across the 49 reviewed studies [138].

Here, the term beginner refers to individuals who conform to the following traits:

Possess domain expertise in energy systems, architecture, or environmental engineering but lack programming experience,
Are in the early stages of graduate study or are industry professionals transitioning from spreadsheet-based analysis to reproducible forecasting workflows,
Prefer graphical interfaces or guided environments over fully script-based setups,
Require step-by-step support throughout the forecasting process, from data preprocessing and model selection to evaluation and visualization.

By targeting this specific audience, the framework aims to reduce technical barriers and foster broader adoption of R in energy forecasting.

5.2. Structuring Forecasting Projects Using CRISP-DM

This review contributes to the democratization of forecasting methods by proposing a taxonomy of forecasting goals and modeling strategies. In addition, it provides actionable guidance on how novice users can construct R-based reproducible and interpretable forecasting pipelines. This includes advice on selecting appropriate packages, designing modular workflows, and interpreting model results in real-world energy management contexts [139,140]. By bridging academic sophistication and practical accessibility, the proposed framework supports the wider goal of increasing the inclusivity and adaptability of energy analytics.

CRISP-DM offers a reliable structure to manage data science projects [141]. Here, the six stages, i.e., business understanding, data understanding, data preparation, modeling, evaluation, and deployment, are adapted to the energy forecasting context using R.

5.2.1. Business Understanding

Beginners frequently face difficulties translating domain-specific problems into formal forecasting tasks. The first step involves defining the use case. For example, we consider the following.

“Forecast daily electricity consumption for a university building over the next 30 days to support HVAC scheduling.”
“Predict hourly solar power output from a rooftop PV system to optimize inverter usage and grid feed-in.”

These objectives influence key modeling decisions, such as the forecasting horizon (e.g., short-term vs. long-term), temporal resolution (e.g., hourly, daily, or monthly), and evaluation metrics.

5.2.2. Data Understanding

Exploratory data analysis is essential for detecting missing values, trends, seasonality, and outliers. The following R packages are recommended for beginners:

readr, data.table: efficient data import and inspection,
dplyr, tidyr: filtering, grouping, and reshaping datasets,
ggplot2, tsibble, feasts: visualizing time trends, autocorrelation, and seasonal cycles.

Several of the reviewed studies applied time series plots, lag plots, and moving averages to explore temporal structures before selecting the model.

5.2.3. Data Preparation

Data preprocessing is frequently the most time-intensive step in the pipeline. Algorithm 1 presents data preparation with time, lag, and temperature features. Key strategies involved in data preprocessing include the following.

Handling Missing Values. Use na.interp() from the forecast package or the imputeTS package for time-aware imputation.
Feature Engineering. Create lag variables with dplyr::lag(), rolling statistics using zoo::rollapply(), or classify weather conditions.
Categorical Encoding. Use caret::dummyVars() for one-hot encoding or convert character variables to factors.
Data Splitting. Apply initial_time_split() from the rsample package to partition time series data into training and testing sets.

Algorithm 1: R Code for Preparation of Structured Data for Energy Forecasting

library(dplyr)

library(lubridate)

data <- read_csv("energy_data.csv") %>%
    mutate(
        datetime = ymd_hms(datetime),
        hour = hour(datetime),
        day = wday(datetime),
        lag1 = lag(load, 1),
        temp_class = ifelse(temp > 30, "hot", "normal")
    )

This example illustrates how intuitive and readable R syntax can support structured preprocessing, even for users with limited coding experience.

5.2.4. Modeling

For beginners, model selection should prioritize a balance between predictive accuracy, interpretability, and ease of implementation. The following R packages were frequently used in the reviewed studies and are recommended here.

ranger: a fast RF implementation; well suited for short-term electricity or building load forecasting.
forecast: offers ARIMA, ETS, and seasonal decomposition models; ideal for long-term trend modeling.
Cubist: combines DTs with linear regression; commonly applied in solar forecasting due to its interpretable rule-based structure.

Example code to train a basic RF model is shown in Algorithm 2.

Algorithm 2: R Code for Basic Random Forest Training

library(ranger)

model <- ranger(load ~ temp + hour + day + lag1, data = train_data)

Several studies in the reviewed literature employed two-stage learners or hybrid ensemble models; however, beginners should start with a single, well-understood model before progressing to ensemble architectures.

5.2.5. Evaluation

The forecasting performance must be evaluated using metrics that are appropriate for the target forecasting context. Algorithm 3 demonstrates forecast evaluation using RMSE, MAPE, and R². Thus, the following evaluation metrics are recommended for beginners based on the prevailing practices in the literature.

MAPE: provides an intuitive percentage-based error rate.
RMSE: penalizes large errors and is sensitive to outliers.
R²: indicates the proportion of variance explained by the model.

Visual and Explainability Tools:

DALEX: offers visual model interpretation tools, e.g., variable importance and PDPs.
yardstick: a tidyverse-compatible package for standardized metric computation.

Algorithm 3: R Code for Evaluating Forecast Performance

library(yardstick)

metrics <- metric_set(rmse, mape, rsq)
metrics(data = test_data, truth = load, estimate = predictions)

These tools enable effective model comparisons and facilitate gaining a deeper understanding of a model’s behavior and its sensitivity to the input features.

5.2.6. Deployment

For beginners, deployment does not necessarily involve real-time integration into production systems. More achievable goals are summarized as follows.

Interactive Dashboards. Use Shiny to build web-based applications that display forecasts, uncertainty intervals, and scenario comparisons.
Dynamic Reporting. Combine narrative, code, and results using RMarkdown to produce reproducible HTML or PDF reports.
Reproducibility. Employ renv to capture package dependencies and ensure consistent computational environments.

Example Use Case:

An RMarkdown report may include visualizations of the forecast outputs, model performance summaries, and contextual notes customized for stakeholders or decision-makers.

5.3. Reproducible and Simplified Workflows

The following practices are also recommended to further reduce the entry barriers for beginners.

Template Sharing: Provide reusable code templates for common forecasting tasks, e.g., STLF, solar PV output prediction, and HVAC demand modeling.
Workflow Automation: Utilize packages from the tidymodels suite, e.g., workflowsets, recipes, and tune, to automate data preprocessing, hyperparameter tuning, and model evaluation.
Interactive Learning Tools: Integrate learnr tutorials with Shiny applications to create guided, interactive forecasting lessons.

As demonstrated in several of the reviewed studies, reproducibility was improved significantly when modular R scripts, version-controlled data workflows, and interpretable modeling outputs were employed. Bridging the gap between research-grade models and operational deployment requires the development of reusable R code templates. Beginners and applied practitioners can benefit from standardized scripts for typical use cases, e.g., daily building load forecasting, hourly PV output prediction, and time series imputation.

When structured around modular functions and thoroughly documented steps, such templates can enable users to perform the following:

Reproduce results from published studies,
Adapt workflows to new datasets and contexts,
Accelerate deployment of forecasting models in applied environments.

In this context, community-curated toolkits and open-source repositories can play a pivotal role in expanding the practical adoption of R-based forecasting workflows.

Beyond methodological democratization, the accessibility of R has important implications for policy and education. Energy forecasting models increasingly inform critical decisions regarding renewable integration, infrastructure investment, and energy pricing; thus, the transparency and interpretability supported by R’s intuitive framework directly align with the requirements of evidence-based policymaking and regulatory oversight.

In addition, the broad availability of open educational resources, online tutorials, and detailed documentation within the R ecosystem positions it as an ideal platform for interdisciplinary education. Students and early-career professionals across domains, e.g., environmental sciences, economics, engineering, and urban planning, can develop essential analytical competencies without facing steep technical hurdles.

As a result, promoting reproducible and accessible R-based forecasting workflows contributes to technical inclusivity and capacity-building efforts, thereby strengthening data literacy among the next generation of energy analysts and policymakers.

6. Conclusions

This literature review synthesizes 49 R-based energy forecasting studies and provides beginner-oriented guidance. The findings of this study serve as an academic summary and a practical roadmap for newcomers implementing energy forecasting projects using open-source tools.

6.1. Summary of Key Findings

Several major insights emerged from the analysis.

R as a Versatile Forecasting Platform. R is effective across a wide range of forecasting tasks, including electricity demand, solar and wind generation, building-level energy utilization, thermal load estimation, and natural gas consumption. Its time series modeling, ML, and data visualization capabilities make it suitable for both research and real-world applications.
Evolving Modeling Paradigms. The literature reviewed in this study revealed a methodological shift from classical statistical techniques (e.g., ARIMA and ETS models) to tree-based ensembles (e.g., RF, XGBoost, and Cubist models), hybrid learning approaches (e.g., ANN + ELM), and symbolic regression methods (e.g., GEP and GMDH). This evolution in methods reflects the increasing demand for performance, adaptability, and interpretability.
Domain-specific Modeling Patterns. Each energy forecasting domain demonstrated distinct methodological preferences. For example, rule-based models and weather classification techniques are frequently employed in solar energy forecasting. In contrast, online learning and feature selection were favored in the building energy prediction context. In addition, thermal load and CHP studies prioritized explainability and integration with optimization tools.
Text Mining Insights. Using the TF-IDF and LDA techniques, this study identified five dominant research themes that align with domain-specific trends and methodological shifts. These themes validated the rise of explainable and hybrid modeling approaches, and they highlighted emerging areas, e.g., multi-energy forecasting and cyber-resilient modeling.
Beginner Accessibility. Most studies employed accessible R packages (e.g., the forecast, ranger, Cubist, xgboost, topicmodels packages), which enable novice users to replicate and modify energy forecasting pipelines. The open-source nature of these tools, combined with extensive community support and documentation, further enhances R’s usability for interdisciplinary researchers.

6.2. Implications for Beginners and Practitioners

This review was deliberately structured to guide beginners through the technical and methodological landscape of energy forecasting, and several actionable recommendations emerge from the findings.

Start with Domain-relevant Models. Forecasting models should be selected based on the characteristics of the target variable, data granularity, and required interpretability. For example, Cubist or GEP may be appropriate for transparent modeling in building energy systems, and XGBoost may be more effective for high-dimensional sensor data.
Leverage the R Ecosystem. R offers a unified ecosystem of packages for data preprocessing, modeling, visualization, and model interpretation. Beginners should start with structured frameworks, e.g., caret, forecast, and ranger, and then gradually explore increasingly advanced tools, e.g., Cubist, topicmodels, and arules.
Utilize Text Mining for Meta-analysis. R’s text mining packages (e.g., tm, textclean, topicmodels, and igraph) provide powerful tools to extract patterns, topics, and research gaps from large document corpora.
Emphasize Reproducibility. R’s script-based nature and compatibility with reproducible workflows (e.g., via RMarkdown) facilitate clear documentation and sharing of forecasting projects. Beginners should maintain well-commented scripts and codebooks to ensure sufficient transparency and reproducibility throughout the analysis process.

6.3. Future Research

As discussed in Section 1, modern energy systems demand forecasting tools that are both transparent and accessible to beginners. This review has demonstrated how the R ecosystem satisfies these needs. As open-source tools continue to evolve, R is well-positioned to support the next generation of data-driven energy solutions. This review has highlighted the role of the R programming language as a gateway to accessible, scalable, and policy-relevant energy forecasting, thereby bridging the gap between technical rigor and practical implementation.

While R has demonstrated significant capabilities in supporting energy forecasting research, several gaps and opportunities remain for future exploration, and addressing these limitations will enhance the technical robustness of forecasting pipelines and improve their accessibility, interpretability, and practical utility.

Integration with Real-time Systems. Future studies should investigate real-time data ingestion and online retraining pipelines within the R environment as energy systems become increasingly dynamic. This is particularly relevant for smart grids, demand response optimization, and microgrid management. In addition, developing streaming-compatible preprocessing and model updating routines in R would greatly enhance its operational viability.
Expansion to DL and AutoML. Although R supports NNs through several packages, e.g., nnet, keras, and mxnet, its use in DL-based energy forecasting remains limited compared with Python. Thus, future research should focus on benchmarking R-based DL pipelines and expanding documentation for energy-specific use cases. AutoML tools, e.g., h2o and mlr3automl, also present promising opportunities, enabling beginners to explore model selection and hyperparameter tuning with minimal manual configuration, and broader adoption of these tools could help bridge the gap between advanced modeling and accessibility.
Energy Forecasting under Uncertainty. Probabilistic forecasting methods, e.g., quantile regression, bootstrapping, and Bayesian modeling, are still underutilized in the R-based energy forecasting literature. Uncertainty quantification will become increasingly important as the penetration of renewables increases and external disturbances become more frequent, and several R packages, e.g., qgam, quantreg, and prophet, have substantial potential for development in this area.
Cross-Domain Forecasting. The emergence of integrated multi-energy systems necessitates forecasting approaches that span electricity, thermal loads, energy storage, and pricing signals. R-based workflows can be extended to support co-modeling strategies that enable simulation and prediction across interconnected subsystems. This expansion requires support for multi-input, multi-output architectures and multi-resolution temporal modeling.
Explainable AI and Interpretability. As the complexity of models increases, the need for interpretability will become increasingly urgent. R’s ecosystem already includes strong XAI tools, e.g., DALEX and lime, which facilitate the visualization of feature importance, local explanations, and counterfactuals. However, future research should focus on generalizing these interpretability workflows and developing domain-specific templates to guide real-world deployment. Embedding explainability throughout the modeling lifecycle will improve trust and adoption among nontechnical stakeholders.
Practical Validation Through Case-Based Comparative Studies. Many of the reviewed studies were experimental or proof-of-concept studies despite highlighting methodological advancements. Thus, comparative evaluations that apply forecasting pipelines to real-world datasets across diverse domains and operational contexts are urgently required, and these studies should emphasize model generalizability, robustness, and implementation tradeoffs. Future research should prioritize reproducible benchmarks that emphasize practical feasibility over theoretical optimization.
User-centered Visualization and Communication. R provides robust tools for static and animated plots; however, there is room to develop interactive visualization frameworks, e.g., those enabled by Shiny, flexdashboard, and plotly. Integrating these tools into forecasting templates would improve communication between modelers and decision-makers by enabling intuitive exploration of uncertainty bands, threshold alerts, and system behavior under various scenarios.

As immediate next steps, we will develop and release a publicly available R benchmark suite—including datasets, scripts, and scoring templates—covering all seven domains, to be hosted in the companion repository. In parallel, we will prototype Shiny-enabled, real-time retraining pipelines with calibrated uncertainty (e.g., quantile ensembles and conformal prediction) and provide beginner-oriented XAI templates (DALEX and LIME) alongside transfer-learning baselines for low-data facilities (e.g., EV charging stations, newly instrumented buildings).

Author Contributions

Conceptualization, M.K. and J.M.; methodology, M.K., H.K. and J.M.; software, M.K.; validation, M.K., H.K. and J.M.; formal analysis, M.K.; investigation, M.K.; resources, H.K.; data curation, M.K.; writing—original draft preparation, M.K.; writing—review and editing, J.M. and H.K.; visualization, M.K.; supervision, J.M. and H.K.; project administration, J.M.; funding acquisition, H.K. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Soonchunhyang University Research Fund.

Data Availability Statement

This article is a literature review that synthesizes information exclusively from previously published sources, all of which are fully cited in the reference list. Consequently, no new data were generated or analyzed. However, to enhance accessibility and reproducibility, example R scripts and workflows from the reviewed studies are compiled and publicly available in our GitHub repository: https://github.com/hwkim24x/Beginner-friendly-Review-of-Research-on-R-based-Energy-Forecasting-Insights-from-Text-Mining (accessed on 25 August 2025).

Acknowledgments

We gratefully acknowledge Enago (Premium Editing Service) and, in particular, Pam Walker for her meticulous English-language editing, which greatly enhanced the fluency and clarity of the manuscript. We also sincerely thank the three anonymous reviewers for their invaluable feedback and the editors for their dedicated support throughout the publication process.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AIC	Akaike information criterion
ANN	Artificial neural network
ARIMA	Autoregressive integrated moving average
ARIMAX	Autoregressive integrated moving average with exogenous variables
ARMA	Autoregressive moving average
BAS	Building automation system
BIC	Bayesian information criterion
BiLSTM	Bidirectional long short-term memory
BMS	Building management system
BSRN	Baseline Surface Radiation Network
CHP	Combined heat and power
CMNL-GP	Clustered multi-node learning with Gaussian processes
CNN	Convolutional neural network
COP	Coefficient of performance
CPU	Central processing unit
CRISP-DM	Cross-Industry Standard Process for Data Mining
CRPS	Continuous ranked probability score
CVAE	Conditional variational autoencoder
CVRMSE	Coefficient of variation of the root mean square error
CVW	Cross-validated weighting
D-MCQRNN	Deep learning-based monotone composite quantile regression neural network
DL	Deep learning
DLM	Dynamic linear model
DTSF	Dynamic time scan forecasting
eDTSF	Ensemble dynamic time scan forecasting
ELM	Extreme learning machine
EOLO	Ensemble optimization for load operations
EPC	Energy performance certificate
EPE	Energy Research Company (Brazil)
ETS	Exponential smoothing state space
EV	Electric vehicle
EW	Equal-weighted
FDIA	False data injection attack
FFNN	Feedforward neural network
GA	Genetic algorithm
GAM	Generalized additive model
GAMLSS	Generalized additive models for location, scale and shape
GAMM	Generalized additive mixed model
GBDT	Gradient-boosted decision tree
GBM	Gradient boosting machine
GECAD	Research Group on Intelligent Engineering and Computing for Advanced Innovation and Development
GEP	Gene expression programming
GFS.FR.MOGUL	Genetic fuzzy system for fuzzy rule learning based on the MOGUL methodology
GHI	Global horizontal irradiance
GLM	Generalized linear model
GLMNET	LASSO and elastic-net regularized generalized linear model
GMDH	Group method of data handling
GMM	Gaussian mixture model
GP	Gaussian processes
GUI	Graphical user interface
HRM	Hot rolling mill
HyFIS	Hybrid neural fuzzy inference system
IAG-USP	Institute of Astronomy, Geophysics and Atmospheric Sciences, University of São Paulo
KARMA	Kumaraswamy ARMA
KDE	Kernel density estimation
KNN	K-nearest neighbors
LASSO	Least absolute shrinkage and selection operator
LDA	Latent Dirichlet allocation
LSTM	Long short-term memory
MAE	Mean absolute error
MAPE	Mean absolute percentage error
MASE	Mean absolute scaled error
MERRA-2	Modern-Era Retrospective Analysis for Research and Applications, Version 2
MIDAS	Mixed data sampling
MinT	Minimum trace (hierarchical reconciliation)
ML	Machine learning
MLP	Multilayer perceptron
MLR	Multiple linear regression
MNR	Multiple nonlinear regression
NAMEMD	Noise-assisted multivariate empirical mode decomposition
NLP	Natural language processing
NUST	Namibia University of Science and Technology
NYISO	New York Independent System Operator
PDP	Partial dependence plot
PIR	Performance improvement ratio
PV	Photovoltaic
QGAM	Quantile generalized additive model
QR	Quantile regression
QRF	Quantile random forest
QRNN	Quantile regression neural network
R²	Coefficient of determination
RABOLA	Ranger-based online learning approach
RF	Random forest
RQs	Research questions
SARIMA	Seasonal autoregressive integrated moving average
SCADA	Supervisory control and data acquisition
SHAP	Shapley additive explanations
SMA	System-, Mess- und Anlagentechnik (Sunny Portal)
SPROUT	Solving the cold-start problem in STLF using tree-based methods
SURFRAD	Surface Radiation Budget
SVM	Support vector machine
SVR	Support vector regression
TBATS	Trigonometric seasonality, Box–Cox transformation, ARMA errors, Trend, and Seasonal components
TF-IDF	Term frequency–inverse document frequency
TS	Time series
UBXII-ARMA	Unit Burr XII quantile ARMA
XAI	Explainable artificial intelligence
XGBoost	Extreme gradient boosting

References

Bashmakov, I.A.; Nilsson, L.J.; Acquaye, A.; Bataille, C.; Cullen, J.M.; Fischedick, M.; de la Rue du Can, S.; Fischedick, M.; Geng, Y.; Tanaka, K. Mitigation Options in Industry. In Climate Change 2022: Mitigation of Climate Change. Working Group III Contribution to the IPCC Sixth Assessment Report; Shukla, P.R., Skea, J., Slade, R., Al Khourdajie, A., Van Diemen, R., Eds.; Cambridge University Press: Cambridge, UK, 2022; Chapter 11. [Google Scholar]
Gielen, D.; Gorini, R.; Wagner, N.; Leme, R.; Gutierrez, L.; Prakash, G.; Asmelash, E.; Janeiro, L.; Gallina, G.; Vale, G.; et al. Global Energy Transformation: A Roadmap to 2050, 2nd ed.; International Renewable Energy Agency: Abu Dhabi, United Arab Emirates, 2019. [Google Scholar]
International Energy Agency (IEA). Growth in Global Energy Demand Surged in 2024 to Almost Twice Its Recent Average. Available online: https://www.iea.org/news/growth-in-global-energy-demand-surged-in-2024-to-almost-twice-its-recent-average (accessed on 12 July 2025).
Ting, D.S.-K.; Vasel-Be-Hagh, A. (Eds.) Mitigating Climate Change: Proceedings of the Mitigating Climate Change 2021 Symposium and Industry Summit (MCC2021); Springer Proceedings in Energy; Springer: Cham, Switzerland, 2022. [Google Scholar]
Faisal, S.; Gao, C. A Comprehensive Review of Integrated Energy Systems Considering Power-to-Gas Technology. Energies 2024, 17, 4551. [Google Scholar] [CrossRef]
Athanasopoulos, G.; Hyndman, R.J.; Kourentzes, N.; Panagiotelis, A. Forecast Reconciliation: A Review. Int. J. Forecast. 2024, 40, 430–456. [Google Scholar] [CrossRef]
Wang, Y.; Chen, Q.; Hong, T.; Kang, C. Review of Smart Meter Data Analytics: Applications, Methodologies, and Challenges. IEEE Trans. Smart Grid 2019, 10, 3125–3148. [Google Scholar] [CrossRef]
Mitra, S.; Chakraborty, B.; Mitra, P. Smart Meter Data Analytics Applications for Secure, Reliable and Robust Grid System: Survey and Future Directions. Energy 2024, 289, 129920. [Google Scholar] [CrossRef]
Sinha, S. State of IoT 2024: Number of Connected IoT Devices Growing 13% to 18.8 Billion Globally. IoT Analytics, 3 September 2024. Available online: https://iot-analytics.com/number-connected-iot-devices/ (accessed on 12 July 2025).
Ye, X.; Zhang, Z.; Qiu, Y. Review of Application of High-Frequency Smart Meter Data in Energy Economics and Policy Research. Front. Sustain. Energy Policy 2023, 2, 1171093. [Google Scholar] [CrossRef]
Apogeo Space. Powering the Future: How Satellite IoT Is Empowering the Energy Industry. Available online: https://www.apogeo.space/powering-the-future-how-satellite-iot-is-empowering-the-energy-industry (accessed on 12 July 2025).
Canali, J. Satellite IoT Market Analysis 2024; Omdia: London, UK, 2024; Available online: https://omdia.tech.informa.com/om121429/satellite-iot-market-analysis--2024 (accessed on 12 July 2025).
Ben Taieb, S.; Huser, R.; Hyndman, R.J.; Genton, M.G. Forecasting Uncertainty in Electricity Smart Meter Data by Boosting Additive Quantile Regression. IEEE Trans. Smart Grid 2016, 7, 2448–2455. [Google Scholar] [CrossRef]
Wang, Y.; Chen, Q.; Kang, Q. Smart Meter Data Analytics: Electricity Consumer Behavior Modeling, Aggregation, and Forecasting; Springer: Singapore, 2020. [Google Scholar]
Maleki, N.; Lundström, O.; Musaddiq, A.; Jeansson, J.; Olsson, T.; Ahlgren, F. Future Energy Insights: Time-Series and Deep Learning Models for City Load Forecasting. Appl. Energy 2024, 374, 124067. [Google Scholar] [CrossRef]
Hong, T.; Pinson, P.; Wang, Y.; Weron, R.; Yang, D.; Zareipour, H. Energy Forecasting: A Review and Outlook. IEEE Open Access J. Power Energy 2020, 7, 376–388. [Google Scholar] [CrossRef]
Bacher, P.; Bergsteinsson, H.G.; Frölke, L.; Sørensen, M.L.; Lemos-Vinasco, J.; Liisberg, J.; Møller, J.K.; Nielsen, H.A.; Madsen, H. Onlineforecast: An R Package for Adaptive and Recursive Forecasting. R J. 2023, 15, 173–194. [Google Scholar] [CrossRef]
Hyndman, R.J.; Athanasopoulos, G. Forecasting: Principles and Practice, 3rd ed.; OTexts: Melbourne, Australia, 2018; Available online: https://otexts.com/fpp3/ (accessed on 12 July 2025).
Liaw, A.; Wiener, M. Classification and regression by randomForest. R News 2002, 2, 18–22. [Google Scholar]
Hyndman, R.J.; Killick, R. CRAN Task View: Time Series Analysis. 2025. Available online: https://CRAN.R-project.org/view=TimeSeries (accessed on 12 July 2025).
Introduction to Iml: Interpretable Machine Learning in R. Available online: https://cran.r-project.org/web/packages/iml/vignettes/intro.html (accessed on 12 July 2025).
Chen, S.; Li, C.; Stull, R.; Li, M. Improved Satellite-Based Intra-Day Solar Forecasting with a Chain of Deep Learning Models. Energy Convers. Manag. 2024, 313, 118598. [Google Scholar] [CrossRef]
Yagli, G.M.; Yang, D.; Srinivasan, D. Automatic hourly solar forecasting using machine learning models. Renew. Sustain. Energy Rev. 2019, 105, 487–498. [Google Scholar] [CrossRef]
Henderson, T.; Fulcher, B.D. Feature-Based Time-Series Analysis in R Using the theft Package. arXiv 2022, arXiv:2208.06146. [Google Scholar]
Chang, J.; He, J.; Lin, C.; Yao, Q. HDTSA: An R Package for High-Dimensional Time Series Analysis. arXiv 2024, arXiv:2412.17341. [Google Scholar] [CrossRef]
Lillis, D. Use R for Data Analysis and Research. N. Z. Sci. Rev. 2011, 68, 73–79. [Google Scholar] [CrossRef]
AlKhader, W.; Salah, K.; Mayyas, A.; Omar, M. Hydrogen Economy Research Using Latent Dirichlet Allocation Topic Modeling: Review, Trends and Future Directions. Clean. Eng. Technol. 2025, 26, 100953. [Google Scholar] [CrossRef]
Oprea, S.V.; Bâra, A. Generative Literature Analysis on the Rise of Prosumers and Their Influence on the Sustainable Energy Transition. Util. Policy 2024, 90, 101799. [Google Scholar] [CrossRef]
Hall, A.; Agarwal, V. Barriers to Adopting Artificial Intelligence and Machine Learning Technologies in Nuclear Power. Prog. Nucl. Energy 2024, 175, 105295. [Google Scholar] [CrossRef]
Benoit, K.; Watanabe, K.; Wang, H.; Nulty, P.; Obeng, A.; Müller, S.; Matsuo, A. Quanteda: An R Package for the Quantitative Analysis of Textual Data. J. Open Source Softw. 2018, 3, 774. [Google Scholar] [CrossRef]
Silge, J.; Robinson, D. tidytext: Text Mining and Analysis Using Tidy Data Principles in R. J. Open Source Softw. 2016, 1, 37. [Google Scholar] [CrossRef]
Selivanov, D.; Bickel, M.; Wang, Q. Text2vec: Modern Text Mining Framework for R, R package version 0.6.4; Comprehensive R Archive Network (CRAN): Vienna, Austria, 2025; Available online: https://CRAN.R-project.org/package=text2vec (accessed on 25 August 2025).
Grün, B.; Hornik, K. topicmodels: An R Package for Fitting Topic Models. J. Stat. Softw. 2011, 40, 1–30. [Google Scholar] [CrossRef]
Roberts, M.E.; Stewart, B.M.; Tingley, D. stm: R Package for Structural Topic Models. J. Stat. Softw. 2019, 91, 1–40. [Google Scholar] [CrossRef]
Wang, E.; Cook, D.; Hyndman, R. A Tidy Data Structure for Time Series to Support Exploration and Modeling of Temporal Data. J. Comput. Graph. Stat. 2020, 29, 466–478. [Google Scholar] [CrossRef]
O’Hara-Wild, M.; Hyndman, R.; Wang, E. Fable: Forecasting Models for Tidy Time Series, R package version 0.4.1; Comprehensive R Archive Network (CRAN): Vienna, Austria, 2025; Available online: https://CRAN.R-project.org/package=fable (accessed on 25 August 2025).
Hyndman, R.J.; Khandakar, Y. Automatic Time Series Forecasting: The forecast Package for R. J. Stat. Softw. 2008, 27, 1–22. [Google Scholar] [CrossRef]
Hyndman, R.J.; Ahmed, R.A.; Athanasopoulos, G.; Shang, H.L. Optimal Combination Forecasts for Hierarchical Time Series. Comput. Stat. Data Anal. 2011, 55, 2579–2589. [Google Scholar] [CrossRef]
Wickramasuriya, S.L.; Athanasopoulos, G.; Hyndman, R.J. Optimal Forecast Reconciliation for Hierarchical and Grouped Time Series via Trace Minimization. J. Am. Stat. Assoc. 2019, 114, 804–819. [Google Scholar] [CrossRef]
Xie, Y.; Allaire, J.J.; Grolemund, G. R Markdown: The Definitive Guide; Chapman & Hall/CRC: Boca Raton, FL, USA, 2018. [Google Scholar]
Chang, W.; Cheng, J.; Allaire, J.J.; Sievert, C.; Schloerke, B.; Xie, Y.; Allen, J.; McPherson, J.; Dipert, A.; Borges, B.; et al. Shiny: Web Application Framework for R, R package version 1.11.1; Comprehensive R Archive Network (CRAN): Vienna, Austria, 2025; Available online: https://CRAN.R-project.org/package=shiny (accessed on 25 August 2025).
Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. Available online: https://jmlr.org/papers/v12/pedregosa11a.html (accessed on 16 August 2025).
Honnibal, M.; Montani, I.; Van Landeghem, S.; Boyd, A. spaCy: Industrial-Strength Natural Language Processing in Python; Zenodo: Geneva, Switzerland, 2020. [Google Scholar] [CrossRef]
Wolf, T.; Debut, L.; Sanh, V.; Chaumond, J.; Delangue, C.; Moi, A.; Cistac, P.; Rault, T.; Louf, R.; Funtowicz, M.; et al. Transformers: State-of-the-Art Natural Language Processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Online, 16–20 November 2020; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 38–45. [Google Scholar]
Seabold, S.; Perktold, J. Statsmodels: Econometric and Statistical Modeling with Python. In Proceedings of the 9th Python in Science Conference (SciPy 2010), Austin, TX, USA, 28 June–3 July 2010. [Google Scholar]
Taylor, S.J.; Letham, B. Forecasting at Scale. Am. Stat. 2018, 72, 37–45. [Google Scholar] [CrossRef]
MathWorks. Text Analytics Toolbox: Documentation. Available online: https://www.mathworks.com/help/textanalytics/ (accessed on 16 August 2025).
Berthold, M.R.; Cebron, N.; Dill, F.; Gabriel, T.R.; Kötter, T.; Meinl, T.; Ohl, P.; Thiel, K.; Wiswedel, B. KNIME: The Konstanz Information Miner. In Data Analysis, Machine Learning and Applications; Preisach, C., Burkhardt, H., Schmidt-Thieme, L., Decker, R., Eds.; Studies in Classification, Data Analysis, and Knowledge Organization; Springer: Berlin/Heidelberg, Germany, 2008; pp. 319–326. [Google Scholar]
Meng, X.; Bradley, J.; Yavuz, B.; Sparks, E.; Venkataraman, S.; Liu, D.; Freeman, J.; Tsai, D.B.; Amde, M.; Owen, S.; et al. MLlib: Machine Learning in Apache Spark. J. Mach. Learn. Res. 2016, 17, 1–7. Available online: https://www.jmlr.org/papers/v17/15-237.html (accessed on 16 August 2025).
SAS Institute Inc. SAS Visual Text Analytics: Documentation. Available online: https://support.sas.com/en/software/visual-text-analytics-support.html (accessed on 16 August 2025).
IBM Corp. IBM SPSS Modeler Text Analytics: Documentation. Available online: https://www.ibm.com/docs/en/watsonx/saas?topic=palette-text-analytics (accessed on 16 August 2025).
Hussain, T.; Min Ullah, F.U.; Muhammad, K.; Rho, S.; Ullah, A.; Hwang, E.; Moon, J.; Baik, S.W. Smart and intelligent energy monitoring systems: A comprehensive literature survey and future research guidelines. Int. J. Energy Res. 2021, 45, 3590–3614. [Google Scholar] [CrossRef]
Gontijo, T.S.; Costa, M.A. Forecasting Hierarchical Time Series in Power Generation. Energies 2020, 13, 3722. [Google Scholar] [CrossRef]
Zhang, C.; Li, R. A novel closed-loop clustering algorithm for hierarchical load forecasting. IEEE Trans. Smart Grid 2020, 12, 432–441. [Google Scholar] [CrossRef]
Allee, A.; Williams, N.J.; Davis, A.; Jaramillo, P. Predicting initial electricity demand in off-grid Tanzanian communities using customer survey data and machine learning models. Energy Sustain. Dev. 2021, 62, 56–66. [Google Scholar] [CrossRef]
Leite Coelho da Silva, F.; da Costa, K.; Canas Rodrigues, P.; Salas, R.; López-Gonzales, J.L. Statistical and artificial neural networks models for electricity consumption forecasting in the Brazilian industrial sector. Energies 2022, 15, 588. [Google Scholar] [CrossRef]
Zhou, Y.; Zhu, D.; Chen, H.; Guo, S.; Xu, C.Y.; Chang, F.-J. Deep Learning-Based Neural Networks for Day-Ahead Power Load Probability Density Forecasting. Environ. Sci. Pollut. Res. 2023, 30, 17741–17764. [Google Scholar] [CrossRef]
Cabreira, M.M.L.; da Silva, F.L.C.; da Silva, J.S.; Cordeiro, J.S.; Tolentino, J.M.U.; Carbo-Bustinza, N.; Rodrigues, P.C. Comparison between Hierarchical Time Series Forecasting Approaches for the Electricity Consumption in the Brazilian Industrial Sector. Appl. Stoch. Models Bus. Ind. 2024, in press. [Google Scholar]
Ribeiro, T.F.; Peña-Ramírez, F.A.; Guerra, R.R.; Alencar, A.P.; Cordeiro, G.M. Forecasting the Proportion of Stored Energy Using the Unit Burr XII Quantile Autoregressive Moving Average Model. Comput. Appl. Math. 2024, 43, 27. [Google Scholar] [CrossRef]
Abbasabadi, N.; Ashayeri, M. From Tweets to Energy Trends (TwEn): An Exploratory Framework for Machine Learning-Based Forecasting of Urban-Scale Energy Behavior Leveraging Social Media Data. Energy Build. 2024, 317, 114440. [Google Scholar] [CrossRef]
He, Y.; Liu, Y.; Zhang, W. Probability Density Prediction of Peak Load Based on Mixed Frequency Noise-Assisted Multivariate Empirical Mode Decomposition. Appl. Intell. 2024, 54, 2648–2672. [Google Scholar] [CrossRef]
Mutombo, N.M.-A.; Numbi, B.P.; Taftiht, T. Total Electricity Generation Dynamics Analysis and Renewable Energy Impacts in South Africa. Energy Sci. Eng. 2024, 12, 4010–4026. [Google Scholar] [CrossRef]
Zournatzidou, G. Advancing Sustainability through Machine Learning: Modelling and Forecasting Renewable Energy Consumption. Sustainability 2025, 17, 1304. [Google Scholar] [CrossRef]
Keka, I.; Çiço, B. Comparing linear and nonlinear models for load profile data using ANOVA, AIC, and BIC. SAGE Open 2025, 15, 21582440251326389. [Google Scholar] [CrossRef]
Akhtar, S.; Shahzad, S.; Zaheer, A.; Ullah, H.S.; Kilic, H.; Gono, R.; Jasiński, M.; Leonowicz, Z. Short-Term Load Forecasting Models: A Review of Challenges, Progress, and the Road Ahead. Energies 2023, 16, 4060. [Google Scholar] [CrossRef]
Chang, Z.; Wu, J.; Liang, H.; Wang, Y.; Wang, Y.; Xiong, X. A Review of Power System False Data Attack Detection Technology Based on Big Data. Information 2024, 15, 439. [Google Scholar] [CrossRef]
Wang, X.; Hyndman, R.J.; Wickramasuriya, S.L. Optimal Forecast Reconciliation with Time Series Selection. Eur. J. Oper. Res. 2025, 323, 455–470. [Google Scholar] [CrossRef]
Zheng, R.; Sumper, A.; Aragüés-Peñalba, M.; Galceran-Arellano, S. Advancing Power System Services with Privacy-Preserving Federated Learning Techniques: A Review. IEEE Access 2024, 12, 76753–76780. [Google Scholar] [CrossRef]
Shen, M.; Lu, Y.; Wei, K.H.; Cui, Q. Prediction of Household Electricity Consumption and Effectiveness of Concerted Intervention Strategies Based on Occupant Behaviour and Personality Traits. Renew. Sustain. Energy Rev. 2020, 127, 109839. [Google Scholar] [CrossRef]
Dominguez-Jimenez, J.A.; Campillo, J.E.; Montoya, O.D.; Delahoz, E.; Hernández, J.C. Seasonality effect analysis and recognition of charging behaviors of electric vehicles: A data science approach. Sustainability 2020, 12, 7769. [Google Scholar] [CrossRef]
Zor, K.; Çelik, Ö.; Timur, O.; Teke, A. Short-Term Building Electrical Energy Consumption Forecasting by Employing Gene Expression Programming and GMDH Networks. Energies 2020, 13, 1102. [Google Scholar] [CrossRef]
Moon, J.; Kim, J.; Kang, P.; Hwang, E. Solving the Cold-Start Problem in Short-Term Load Forecasting Using Tree-Based Methods. Energies 2020, 13, 886. [Google Scholar] [CrossRef]
Fan, C.; Sun, Y.; Xiao, F.; Ma, J.; Lee, D.; Wang, J.; Tseng, Y.C. Statistical investigations of transfer learning-based methodology for short-term building energy predictions. Appl. Energy 2020, 262, 114499. [Google Scholar] [CrossRef]
Wenninger, S.; Wiethe, C. Benchmarking Energy Quantification Methods to Predict Heating Energy Performance of Residential Buildings in Germany. Bus. Inf. Syst. Eng. 2021, 63, 223–242. [Google Scholar] [CrossRef]
Fan, C.; Chen, M.; Tang, R.; Wang, J. A Novel Deep Generative Modeling-Based Data Augmentation Strategy for Improving Short-Term Building Energy Predictions. Build. Simul. 2022, 15, 197–211. [Google Scholar] [CrossRef]
Jozi, A.; Pinto, T.; Vale, Z. Contextual Learning for Energy Forecasting in Buildings. Int. J. Electr. Power Energy Syst. 2022, 136, 107707. [Google Scholar] [CrossRef]
Kim, Y.T.; Kim, B.J.; Kim, S.W. Multi-level stacked regression for predicting electricity consumption of hot rolling mill. Expert Syst. Appl. 2022, 201, 117040. [Google Scholar] [CrossRef]
Moon, J.; Park, S.; Rho, S.; Hwang, E. Robust Building Energy Consumption Forecasting Using an Online Learning Approach with R ranger. J. Build. Eng. 2022, 47, 103851. [Google Scholar] [CrossRef]
Kondi-Akara, G.; Hingray, B.; Francois, B.; Diedhiou, A. Recent trends in urban electricity consumption for cooling in West and Central African countries. Energy 2023, 276, 127597. [Google Scholar] [CrossRef]
Zhang, J.; Yuan, C.; Yang, J.; Zhao, L. Research on Energy Consumption Prediction Models for High-Rise Hotels in Guangzhou, Based on Different Machine Learning Algorithms. Buildings 2024, 14, 356. [Google Scholar] [CrossRef]
Cebeci, C.; Zor, K. Electricity Demand Forecasting Using Deep Polynomial Neural Networks and Gene Expression Programming during COVID-19 Pandemic. Appl. Sci. 2025, 15, 2843. [Google Scholar] [CrossRef]
Timur, O.; Üstünel, H.Y. Short-Term Electric Load Forecasting for an Industrial Plant Using Machine Learning-Based Algorithms. Energies 2025, 18, 1144. [Google Scholar] [CrossRef]
Gao, Y.; Ruan, Y.; Fang, C.; Yin, S. Deep learning and transfer learning models of energy consumption forecasting for a building with poor information data. Energy Build. 2020, 223, 110156. [Google Scholar] [CrossRef]
Chen, G.; Lu, S.; Zhou, S.; Tian, Z.; Kim, M.K.; Liu, J.; Liu, X. A Systematic Review of Building Energy Consumption Prediction: From Perspectives of Load Classification, Data-Driven Frameworks, and Future Directions. Appl. Sci. 2025, 15, 3086. [Google Scholar] [CrossRef]
Liu, C.; Antypenko, R.; Sushko, I.; Zakharchenko, O. Intrusion detection system after data augmentation schemes based on the VAE and CVAE. IEEE Trans. Reliab. 2022, 71, 1000–1010. [Google Scholar] [CrossRef]
Tu, H.; Moura, S.; Wang, Y.; Fang, H. Integrating physics-based modeling with machine learning for lithium-ion batteries. Appl. Energy 2023, 329, 120289. [Google Scholar] [CrossRef]
Yagli, G.M.; Yang, D.; Gandhi, O.; Srinivasan, D. Can We Justify Producing Univariate Machine-Learning Forecasts with Satellite-Derived Solar Irradiance? Appl. Energy 2020, 259, 114122. [Google Scholar] [CrossRef]
Yagli, G.M.; Yang, D.; Srinivasan, D. Ensemble Solar Forecasting Using Data-Driven Models with Probabilistic Post-Processing through GAMLSS. Sol. Energy 2020, 208, 612–622. [Google Scholar] [CrossRef]
Dobreva, P.; van Dyk, E.E.; Vorster, F.J. New approach to evaluating predictive models of photovoltaic systems. Sol. Energy 2020, 204, 134–143. [Google Scholar] [CrossRef]
de Freitas Viscondi, G.; Alves-Souza, S.N. Solar irradiance prediction with machine learning algorithms: A Brazilian case study on photovoltaic electricity generation. Energies 2021, 14, 5657. [Google Scholar] [CrossRef]
Mukilan, K.; Thaiyalnayaki, K.; Dwivedi, Y.D.; Isaac, J.S.; Poonia, A.; Sharma, A.; Al-Ammar, E.A.; Wabaidur, S.M.; Subramanian, B.B.; Kassa, A. Prediction of Rooftop Photovoltaic Solar Potential Using Machine Learning. Int. J. Photoenergy 2022, 2022, 1541938. [Google Scholar] [CrossRef]
Masache, A.; Mdlongwa, P.; Maposa, D.; Sigauke, C. Short-term forecasting of solar irradiance using decision tree-based models and non-parametric quantile regression. PLoS ONE 2024, 19, e0312814. [Google Scholar] [CrossRef] [PubMed]
Travieso-González, C.M.; Cabrera-Quintero, F.; Piñán-Roescher, A.; Celada-Bernal, S. A review and evaluation of the state of art in image-based solar energy forecasting: The methodology and technology used. Appl. Sci. 2024, 14, 5605. [Google Scholar] [CrossRef]
Rouges, E.; Kretschmer, M.; Shepherd, T.G. On the link between weather regimes and energy shortfall during winter for 28 European countries. Meteorol. Appl. 2025, in press. [Google Scholar] [CrossRef]
Frank, C.W.; Wahl, S.; Keller, J.D.; Pospichal, B.; Hense, A.; Crewell, S. Bias correction of a novel European reanalysis data set for solar energy applications. Sol. Energy 2018, 164, 12–24. [Google Scholar] [CrossRef]
Fernández-Jiménez, L.A.; Ramírez-Rosado, I.J.; Monteiro, C. Semiparametric Short-Term Probabilistic Forecasting Models for Hourly Power Generation in PV Plants. IEEE Access 2024, 12, 160133–160155. [Google Scholar] [CrossRef]
Donohoe, A.; Dawson, E.; McMurdie, L.; Battisti, D.S.; Rhines, A. Seasonal asymmetries in the lag between insolation and surface temperature. J. Clim. 2020, 33, 3921–3945. [Google Scholar] [CrossRef]
Zhang, Y.; Lin, F.; Wang, K. Robustness of Short-Term Wind Power Forecasting against False Data Injection Attacks. Energies 2020, 13, 3780. [Google Scholar] [CrossRef]
Costa, M.A.; Ruiz-Cárdenas, R.; Mineti, L.B.; Prates, M.O. Dynamic Time Scan Forecasting for Multi-Step Wind Speed Prediction. Renew. Energy 2021, 177, 584–595. [Google Scholar] [CrossRef]
Prieto-Herráez, D.; Martínez-Lastras, S.; Frías-Paredes, L.; Asensio, M.I.; González-Aguilera, D. EOLO, a Wind Energy Forecaster Based on Public Information and Automatic Learning for the Spanish Electricity Markets. Measurement 2024, 231, 114557. [Google Scholar] [CrossRef]
English, L.; Abolghasemi, M. Improving the forecast accuracy of wind power by leveraging multiple hierarchical structure. Sustain. Energy Grids Netw. 2024, 40, 101517. [Google Scholar] [CrossRef]
Sri Preethaa, K.R.; Muthuramalingam, A.; Natarajan, Y.; Wadhwa, G.; Ali, A.A.Y. A comprehensive review on machine learning techniques for forecasting wind flow pattern. Sustainability 2023, 15, 12914. [Google Scholar] [CrossRef]
Ahmadi, A.; Nabipour, M.; Mohammadi-Ivatloo, B.; Amani, A.M.; Rho, S.; Piran, M.J. Long-term wind power forecasting using tree-based learning algorithms. IEEE Access 2020, 8, 151511–151522. [Google Scholar] [CrossRef]
Geng, X.; He, X.; Hu, M.; Bi, M.; Teng, X.; Wu, C. Multi-attention network with redundant information filtering for multi-horizon forecasting in multivariate time series. Expert Syst. Appl. 2024, 257, 125062. [Google Scholar] [CrossRef]
Wu, K.; Li, J.; Zhang, B.; Yu, Z.; Liu, X. Preventive dispatch strategy against FDIA induced overloads in power systems with high wind penetration. IEEE Access 2020, 8, 210452–210461. [Google Scholar] [CrossRef]
Liu, Y.; Hu, X.; Luo, X.; Zhou, Y.; Wang, D.; Farah, S. Identifying the Most Significant Input Parameters for Predicting District Heating Load Using an Association Rule Algorithm. J. Clean. Prod. 2020, 275, 122984. [Google Scholar] [CrossRef]
Bujalski, M.; Madejski, P. Forecasting of heat production in combined heat and power plants using generalized additive models. Energies 2021, 14, 2331. [Google Scholar] [CrossRef]
Li, X.; Yao, R. Modelling Heating and Cooling Energy Demand for Building Stock Using a Hybrid Approach. Energy Build. 2021, 235, 110740. [Google Scholar] [CrossRef]
Dulce-Chamorro, E.; van Dyk, E.E.; Vorster, F.J. Parsimonious modelling for estimating hospital cooling demand to improve energy efficiency. Logic J. IGPL 2022, 30, 635–648. [Google Scholar] [CrossRef]
Żymetka, P.; Szega, M. Short-Term Scheduling of Gas-Fired CHP Plant with Thermal Storage Using Optimization Algorithm and Forecasting Models. Energy Convers. Manag. 2021, 231, 113860. [Google Scholar] [CrossRef]
Shin, J.H.; Cho, Y.H. Machine-learning-based coefficient of performance prediction model for heat pump systems. Appl. Sci. 2022, 12, 362. [Google Scholar] [CrossRef]
Pala, Z. Comparative Study on Monthly Natural Gas Vehicle Fuel Consumption and Industrial Consumption Using Multi-Hybrid Forecast Models. Energy 2023, 263, 125826. [Google Scholar] [CrossRef]
Bujalski, M.; Madejski, P.; Fuzowski, K. Day-ahead heat load forecasting during the off-season in the district heating system using Generalized Additive model. Energy Build. 2023, 278, 112630. [Google Scholar] [CrossRef]
Tudor, C.; Sova, R.; Stamatiou, P.; Vlachos, V.; Polychronidou, P. Future-Proofing EU-27 Energy Policies with AI: Analysing and Forecasting Fossil Fuel Trends. Electronics 2025, 14, 631. [Google Scholar] [CrossRef]
Stopps, H.; Touchie, M. Smart choice or flawed approach? An exploration of connected thermostat data fidelity and use in data-driven modelling in high-rise residential buildings. J. Build. Perform. Simul. 2021, 14, 793–813. [Google Scholar] [CrossRef]
Militino, A.F.; Moradi, M.; Ugarte, M.D. On the performances of trend and change-point detection methods for remote sensing data. Remote Sens. 2020, 12, 1008. [Google Scholar] [CrossRef]
Ryu, S.; Kim, M.; Kim, H. Denoising autoencoder-based missing value imputation for smart meters. IEEE Access 2020, 8, 40656–40666. [Google Scholar] [CrossRef]
Günay, M.E.; Tapan, N.A. Analysis of PEM and AEM electrolysis by neural network pattern recognition, association rule mining and LIME. Energy AI 2023, 13, 100254. [Google Scholar] [CrossRef]
Falay, B.; Schweiger, G.; O’Donovan, K.; Leusbrock, I. Enabling large-scale dynamic simulations and reducing model complexity of district heating and cooling systems by aggregation. Energy 2020, 209, 118410. [Google Scholar] [CrossRef]
Haider, S.N.; Zhao, Q.; Li, X. Cluster-based prediction for batteries in data centers. Energies 2020, 13, 1085. [Google Scholar] [CrossRef]
Almaghrebi, A.; Aljuhehsi, F.; Rafaei, M.; James, K.; Alahmad, M. Data-Driven Charging Demand Prediction at Public Charging Stations Using Supervised Machine Learning Regression Methods. Energies 2020, 13, 4231. [Google Scholar] [CrossRef]
Vink, K.; Ankyu, E.; Kikuchi, Y. Long-term forecasting potential of photo-voltaic electricity generation and demand using R. Appl. Sci. 2020, 10, 4462. [Google Scholar] [CrossRef]
Gilanifar, M.; Parvania, M. Clustered Multi-Node Learning of Electric Vehicle Charging Flexibility. Appl. Energy 2021, 282, 116125. [Google Scholar] [CrossRef]
Skoczkowski, T.; Bielecki, S.; Wołowicz, M.; Węglarz, A. Redefining Energy Management for Carbon-Neutral Supply Chains in Energy-Intensive Industries: An EU Perspective. Energies 2025, 18, 3932. [Google Scholar] [CrossRef]
Kurdila, A.; L’Afflitto, A.; Burns, J.A. Data-Driven, Nonparametric, Adaptive Control Theory. In Lecture Notes in Control and Information Sciences; Springer: Cham, Switzerland, 2025; ISBN 978-3-031-78002-8. [Google Scholar]
Li, S.; Jiang, Z.; Xu, Z. BIM-Based Model Checking: A Scientometric Analysis and Critical Review. Appl. Sci. 2025, 15, 49. [Google Scholar] [CrossRef]
Hellín, C.J.; Valledor, A.; Cuadrado-Gallego, J.J.; Tayebi, A.; Gómez, J. A comparative study on R packages for text mining. IEEE Access 2023, 11, 99083–99100. [Google Scholar] [CrossRef]
Benoit, K.; Muhr, D.; Watanabe, K. Stopwords: Multilingual Stopword Lists; R Package Version 2.3; Comprehensive R Archive Network (CRAN): Vienna, Austria, 2021; Available online: https://cran.r-project.org/package=stopwords (accessed on 14 July 2025).
Rinker, T. Textstem: Tools for Stemming and Lemmatizing Text; R Package Version 0.1.4; CRAN: Vienna, Austria, 2018; Available online: https://cran.r-project.org/package=textstem (accessed on 14 July 2025).
Grün, B.; Hornik, K. Topic models in R; Preprint; CiteseerX, 2021. Available online: https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=96441eb3ddc28257d46f730212d9d6fc11750ed1 (accessed on 25 August 2025).
Röder, M.; Both, A.; Hinneburg, A. Exploring the Space of Topic Coherence Measures. In Proceedings of the 8th ACM International Conference on Web Search and Data Mining (WSDM 2015), Shanghai, China, 2–6 February 2015; pp. 399–408. [Google Scholar]
Chuang, J.; Manning, C.D.; Heer, J. Termite: Visualization Techniques for Assessing Textual Topic Models. In Proceedings of the International Working Conference on Advanced Visual Interfaces, Capri Island, Italy, 21–25 May 2012; pp. 74–77. [Google Scholar]
Sievert, C.; Shirley, K. LDAvis: A Method for Visualizing and Interpreting Topics. In Proceedings of the ACL Workshop on Interactive Language Learning, Visualization, and Interfaces (ILLVI 2014), Baltimore, MD, USA, 27 June 2014; pp. 63–70. [Google Scholar]
Kuhn, M. Building Predictive Models in R Using the caret Package. J. Stat. Softw. 2008, 28, 1–26. [Google Scholar] [CrossRef]
ENTSOG (European Network of Transmission System Operators for Gas). European Gas Flow Dashboard. Available online: https://gasdashboard.entsog.eu/ (accessed on 14 July 2025).
Gelaro, R.; McCarty, W.; Suárez, M.J.; Todling, R.; Molod, A.; Takacs, L.; Randles, C.A.; Darmenov, A.; Bosilovich, M.G.; Reichle, R.; et al. The Modern-Era Retrospective Analysis for Research and Applications, Version 2 (MERRA-2). J. Clim. 2017, 30, 5419–5454. [Google Scholar] [CrossRef]
U.S. Energy Information Administration (EIA). Wholesale Electricity Market Data—NYISO: Hourly Actual and Forecast Load (System and Zone); U.S. Department of Energy: Washington, DC, USA, 2025. Available online: https://www.eia.gov/electricity/wholesalemarkets/data.php?rto=nyiso (accessed on 25 August 2025).
Elkabalawy, M.; Al-Sakkaf, A.; Mohammed Abdelkader, E.; Alfalah, G. CRISP-DM-Based Data-Driven Approach for Building Energy Prediction Utilizing Indoor and Environmental Factors. Sustainability 2024, 16, 7249. [Google Scholar] [CrossRef]
May, M.; Dancho, M. Getting Started with Modeltime; R Package Vignette, Version 1.2.6; CRAN: Vienna, Austria, 2025; Available online: https://cran.r-project.org/package=modeltime (accessed on 14 July 2025).
Wickham, H.; Averick, M.; Bryan, J.; Chang, W.; McGowan, L.D.A.; François, R.; Grolemund, G.; Hayes, A.; Henry, L.; Hester, J.; et al. Welcome to the Tidyverse. J. Open Source Softw. 2019, 4, 1686. [Google Scholar] [CrossRef]
Chapman, P.; Clinton, J.; Kerber, R.; Khabaza, T.; Reinartz, T.; Shearer, C.; Wirth, R. CRISP-DM 1.0: Step-by-Step Data Mining Guide; SPSS Inc.: Chicago, IL, USA, 2000; Available online: https://mineracaodedados.files.wordpress.com/2012/12/crisp-dm-1-0.pdf (accessed on 14 July 2025).

Figure 1. Literature selection and filtering process.

Figure 18. Timeline of methodological shifts in R-based energy forecasting (2020–2025).

Table 1. Software options for text mining and energy forecasting.

Ecosystem	Text Mining Stack	Energy Forecasting Fit	Reproducibility and Reporting	CPU/GPU and Scalability
R	quanteda [30]; tidytext [31]; text2vec [32]; topicmodels [33]; stm [34]	tsibble [35]/fable [36]; forecast [37]; hts/MinT [38,39]; strong for hierarchical and classical/hybrid time series (TS)	R Markdown [40]/Quarto; Shiny [41]; cohesive single-language workflow	CPU-friendly; scales via data.table/arrow/sparklyr if needed
Python	scikit-learn [42]; spaCy [43]; Gensim; transformers [44]	statsmodels [45]; Prophet [46]; many deep learning (DL) forecasters	Jupyter/Quarto; rich MLOps choices	Excels with GPU; slower with CPU-only DL; strong cloud scaling
MATLAB	Text Analytics Toolbox [47]	Econometric/engineering TS toolkits; Simulink adjacency	Live script/report generator	Optimized CPU; Parallel Toolbox/GPU optional
KNIME [48]/ RapidMiner	Visual Text Processing nodes	Integrates with Python/R extensions; easy prototyping	Visual provenance ≈ “reproducible protocol”	Scales via server/cluster editions
Spark/MLlib [49]	HashingTF/IDF; CountVectorizer; pipelines	Suited to streaming/large corpora; TS via Spark packages	Notebooks + workflow managers	Cluster-native; requires big-data setup
SAS/IBM SPSS	VTA [50]/Modeler Text Analytics [51]	Enterprise forecasting modules; governance	Studio/Modeler reports	Scales in enterprise stacks

Table 2. Summary of R-based studies on electricity forecasting.

Study	Region/Domain	Models Used (Best Highlighted)	R Packages
Gontijo and Costa [53]	Brazil (gov. energy data)	MinT (best), ARIMA, ETS	hts
Zhang and Li [54]	Ireland (CER smart meters)	CLC + MLR	lm
Allee et al. [55]	Tanzania (mini-grids)	LASSO (best), RF, intercept-only, LGOCV by site	glmnet, randomForest, Boruta
Silva et al. [56]	Brazil (industrial)	MLP (best), Holt–Winters, SARIMA, DLM, TBATS, NNAR	forecast, dlm
Zhou et al. [57]	China (province)	D-MCQRNN	keras, tensorflow
Cabreira et al. [58]	Brazil (EPE industrial)	LSTM (best), ARIMA, FNN, ETS	hts, fpp3, forecast
Ribeiro et al. [59]	Southeast Brazil	UBXII-ARMA (Quantile Regression + ARMA)	stats, monthplot, optim
Abbasabadi and Ashayeri [60]	USA (NYISO + Twitter)	ANN, RF, GBM	academictwitteR, caret
He et al. [61]	USA (Vermont + Houston)	MIDAS-NAMEMD-QRNN	custom
Mutombo et al. [62]	South Africa (IEA data)	MLR	lm, ggplot2
Zournatzidou [63]	USA (EIA data)	LightGBM, XGBoost, SVR, etc. with log-differencing	Packages not specified
Keka and Cxicxo [64]	Three utility substations	Polynomial regression (degree 4)	lm

Table 3. Summary of R-based studies on energy forecasting in buildings.

Study	Model(s)	Notable Feature
Shen et al. [69]	SVR (GA optimized)	Household demand, China (survey + smart meter)
Dominguez-Jimenez et al. [70]	Compared 12 classifiers; GLMNET identified as optimal	44 public charging stations (aggregated)
Zor et al. [71]	GEP, GMDH	Symbolic regression, polynomial terms
Moon et al. [72]	SPROUT (hybrid RF + transfer learning)	Transfer learning, calendar adjustment
Fan et al. [73]	Transfer learning with pre-trained CNN and BiLSTM	Feature extraction and weight initialization
Wenninger and Wiethe [74]	ANN, SVR, RF, XGBoost, D-vine copula QR	EPC benchmark, feature importance
Fan et al. [75]	CVAE + CNN + BiLSTM	Data augmentation, synthetic generation
Jozi et al. [76]	Clustered models (SVM, HyFIS, etc.)	Contextual clustering, HVAC integration
Kim et al. [77]	LM/RF/GBM/SVM (ridge stacking)	Industrial load, South Korea (2019–2020)
Moon et al. [78]	RABOLA (two-stage RF)	Sliding window, feature importance, PDPs
Kondi-Akara et al. [79]	MLR (change-point style)	City demand, West and Central Africa (2000–2019)
Zhang et al. [80]	15 R-based models, polynomial regression (best)	Sim-to-ML, LHS sampling, regression coeff.
Cebeci and Zor [81]	Deep polynomial NN and GEP	Integrated health indicators in high-frequency forecasting
Timur and Üstünel [82]	GBDT (best), MLR, MLP, GMDH, GEP	Industrial demand, Turkey (30,000 obs.)

Table 4. Summary of R-based studies on solar energy forecasting.

Study	Model(s)	Notable Feature
Yagli et al. [87]	Cubist, GLM, SVM, RF, PPR	Bias correction, kernel density
Yagli et al. [88]	GAMLSS, QRF	Probabilistic modeling
Dobreva et al. [89]	PVSYST variants: NP1X, NP1Y (best), GnX, GnY	AC energy from SMA (daily) to monthly evaluation; NP1Y best
de Freitas Viscondi and Alves-Souza [90]	SVM (best), ANN, ELM (fastest training)	Final modeling dataset size and period: 19,359 daily obs., 1962–2014
Mukilan et al. [91]	RBM	Nonlinearity capture, K-fold CV
Masache et al. [92]	QGAM (best), QRRF, RF, GAM	Use of hermit (hierarchical pairwise interaction LASSO)

Table 5. Summary of R-based studies on wind energy forecasting.

Study	Model(s)	Notable Feature
Zhang et al. [98]	MNR, ANN, SVM, QR, QRNN, KNN-KDE	Cybersecurity-aware evaluation
Costa et al. [99]	DTSF, eDTSF	Analog-based pattern matching
Prieto-Herráez et al. [100]	EOLO (ensemble of 8 ML models)	Economic optimization, no manual config.
English and Abolghasemi [101]	Linear regression + MinT reconciliation	Temporal coherence across resolutions

Table 6. Summary of R-based studies on thermal energy forecasting.

Study	Model(s)	Notable Feature
Liu et al. [106]	Eclat-SVR vs. Spearman-SVR	E-SVR reduced RMSE by 28.1%; Improved accuracy by 8.2%
Bujalski and Madejski [107]	GAM (temp/solar/wind, hour-of-day)	12-day window CHP heat; solar and wind improve accuracy; MAPE: <7%
Li and Yao [108]	EnergyPlus + 10 ML models incl. SVR, RF, and XGBoost	Polynomial SVR is best for individuals; Gaussian SVR is best for aggregates
Dulce-Chamorro and Martínez-de-Pisón [109]	GAparsimony-optimized SVR/ANN/XGB	Target smoothing via Gaussian filter (window = 11)
Żymełka and Szega [110]	Feedforward ANN (5 layers)	RMSE: 7.18, R: 0.9886, MAPE: 3.65%
Shin and Cho [111]	ANN, SVM, RF, KNN	COP prediction satisfied ASHRAE Guideline 14
Pala [112]	forecastHybrid (auto.arima, nnetar, stlm, thetam, ets, tbats); Ensemble weighting (EW, CVW)	CVW ensemble > DL for seasonal gas data
Bujalski et al. [113]	GAMM (GAM + AR error)	14-day window of off-season DH load; calendar effects dominate; MAPE: 3.26%
Tudor et al. [114]	RF (with SHAP, PDP)	Projected 1.8% to 1.33% by 2030; interpretable ensemble learning

Table 7. Summary of R-based forecasting studies in hybrid and emerging energy domains.

Study	Model(s)	Notable Feature
Haider et al. [120]	Cluster-assisted ARIMA with exogenous clustered predictors	k-shape ARIMA outperformed DTW and nonclustered baselines
Almaghrebi et al. [121]	XGBoost (best), RF, SVM, LR	Public EV charging, USA (22,000 sessions)
Vink et al. [122]	SVR vs. LM for PV; SVR better in 2015–2016; LM better in 2017; Quadratic nonlinear regression for demand	Electricity cost predictions within ~8% of actual costs
Gilanifar and Parvania [123]	CMNL-GP	EV charging demand, USA (53 stations)

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Kim, M.; Kim, H.; Moon, J. Beginner-Friendly Review of Research on R-Based Energy Forecasting: Insights from Text Mining. Electronics 2025, 14, 3513. https://doi.org/10.3390/electronics14173513

AMA Style

Kim M, Kim H, Moon J. Beginner-Friendly Review of Research on R-Based Energy Forecasting: Insights from Text Mining. Electronics. 2025; 14(17):3513. https://doi.org/10.3390/electronics14173513

Chicago/Turabian Style

Kim, Minjoong, Hyeonwoo Kim, and Jihoon Moon. 2025. "Beginner-Friendly Review of Research on R-Based Energy Forecasting: Insights from Text Mining" Electronics 14, no. 17: 3513. https://doi.org/10.3390/electronics14173513

APA Style

Kim, M., Kim, H., & Moon, J. (2025). Beginner-Friendly Review of Research on R-Based Energy Forecasting: Insights from Text Mining. Electronics, 14(17), 3513. https://doi.org/10.3390/electronics14173513

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Beginner-Friendly Review of Research on R-Based Energy Forecasting: Insights from Text Mining

Abstract

1. Introduction

2. Literature Review

2.1. Comparison of R with Other Software for Text Mining in Energy Forecasting

2.2. Review Selection Criteria: Publication Period, Search Terms, and R-Based Relevance

2.3. Article Collection Process: Database, Search Strings, and Inclusion and Exclusion Criteria

2.4. Domain-Wise Literature Analysis

2.4.1. Electricity Forecasting

2.4.2. Energy Forecasting in Buildings

2.4.3. Solar Energy Forecasting

2.4.4. Wind Energy Forecasting

2.4.5. Thermal and Gas Energy Forecasting

2.4.6. Hybrid and Emerging Energy Systems

3. Review Design and Methodology

3.1. Text Mining Techniques: TF-IDF, Keyword Extraction, Topic Modeling

3.1.1. Preprocessing and Stopword Strategy

3.1.2. TF-IDF and Keyword Analysis

3.1.3. Topic Modeling with LDA

3.2. R Packages and Workflow Structure

3.3. Reproducibility and Accessibility for Novice Users

4. Results and Analysis

4.1. Keyword Analysis: A Multi-Metric Approach

4.2. Frequency Analysis

4.3. Word Cloud

4.4. Bigram Analysis

4.5. LDA Topic Modeling

4.6. Emerging Application Domains

4.7. Methodological Trends Identified via Text Mining

4.7.1. Annual Growth

4.7.2. Keyword Shifts

4.7.3. Evolution of Modeling Strategies

5. Discussion and Practical Implications

5.1. Defining the Beginner Audience

5.2. Structuring Forecasting Projects Using CRISP-DM

5.2.1. Business Understanding

5.2.2. Data Understanding

5.2.3. Data Preparation

5.2.4. Modeling

5.2.5. Evaluation

5.2.6. Deployment

5.3. Reproducible and Simplified Workflows

6. Conclusions

6.1. Summary of Key Findings

6.2. Implications for Beginners and Practitioners

6.3. Future Research

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI