1. Introduction
Accurate energy demand forecasting has become a crucial component in optimizing energy-intensive industrial processes [
1]. In the context of manufacturing sectors such as quicklime production, effective forecasting not only supports operational efficiency but also contributes to strategic planning and sustainability initiatives [
2]. Quicklime manufacturing involves multiple stages, such as crushing, screening, calcination, and grinding, each associated with distinct energy consumption patterns influenced by process dynamics and scheduling [
3]. As industries strive to reduce costs and carbon emissions, the ability to anticipate and manage energy demand with precision becomes a competitive advantage.
Traditionally, forecasting approaches such as ARIMA and Exponential Smoothing have served as standard tools for time series prediction [
4]. While these classical models perform well under linear and stationary conditions, they often struggle to capture the nonlinearities, complex interdependencies, and external influences prevalent in industrial environments [
5]. The proliferation of sensor-based data in modern settings has made richer datasets available, but it also introduces modeling challenges that surpass the capabilities of traditional statistical techniques [
6]. Comprehensive reviews have highlighted a broad spectrum of methodologies—including multiple regression, adaptive load forecasting, stochastic time series, ARMAX models with genetic algorithms, fuzzy logic, neural networks, and expert systems—as preferred alternatives for addressing these complexities [
7]. This diversity reflects the multifaceted nature of energy forecasting in industrial contexts, where tailored solutions are often necessary.
Recent advances in machine learning, particularly the development of neural networks such as recurrent neural networks (RNNs), long short-term memory (LSTM) networks, and gated recurrent units (GRUs), have demonstrated significant potential in predicting the power profiles of manufacturing processes before production begins. Models oriented towards batch production, for example, have achieved notable accuracy, with absolute prediction errors as low as 5% [
8]. These approaches showcase the advantages of leveraging deep learning to capture temporal dependencies and complex relationships in industrial energy consumption data.
In terms of short-term load forecasting (STLF), recent work has shifted toward hybrid deep learning architectures, uncertainty quantification through probabilistic modeling, and the integration of diverse IoT and environmental data.
Some studies have concentrated on developments in the design of hybrid deep learning architectures; these are focused on combining different neural network layers (such as CNN for spatial/local features and LSTM for temporal dependencies) to improve accuracy. Some examples make use of the combination of CNN and LSTM with techniques such as Empirical Mode Decomposition (EMD) to simplify complex energy demand profiles [
9], or K-means for improving classification tasks [
10]. The use of attention mechanisms is combined with with LSTM [
11] and a convolutional bidirectional long short-term memory autoencoder (CBLSTM-AE) model, and Beluga Whale Optimization (BWO) for hyperparameter tuning [
12]. In terms of probabilistic and interval forecasting, some works have tackled uncertainty by predicting ranges or probability densities. Some examples include Quantile Regression (QR) [
10,
13,
14,
15,
16] and Bayesian Approaches [
17].
Optimization and Automated Machine Learning (AutoML) have also been studied by emphasizing how the model is tuned, using meta-heuristic algorithms to find the best hyperparameters, such as genetic algorithms [
13], Beluga Whale Optimization [
12], Bayesian Optimization [
9] and AutoML for feature selection and model training, which further improved accuracy and generalization [
18].
In terms of domain-specific applications, several works have studied applications on hospitals [
19], industrial plants [
20], manufacturing [
21], residential, microgrids, smartgrids [
9,
10,
15,
17,
22] and day-ahead interval forecasting method for photovoltaic (PV) power [
14].
Based on these developments, deep learning has emerged as a powerful alternative to time series forecasting in recent years, offering superior performance in capturing complex temporal dynamics [
23]. Architectures like LSTM and GRU excel in learning long-range dependencies and nonlinear relationships [
24]. Comparative studies across diverse domains—including energy systems—consistently demonstrate that deep learning models outperform statistical methods, especially for multivariate and multi-horizon forecasting tasks [
25].
Among these advanced architectures, the Temporal Fusion Transformer (TFT) has attracted increasing attention due to its hybrid structure, which combines the strengths of recurrent layers for local processing with interpretable attention mechanisms for capturing long-term dependencies [
26]. Originally introduced by Lim et al. in 2021 [
27], TFT was designed to handle heterogeneous inputs, including static metadata, known future events, and time-varying covariates. Its use of gating layers, variable selection networks, and quantile regression makes it particularly suitable for forecasting tasks that require both accuracy and interpretability [
28].
The benefits of TFT have been validated in various sectors. For instance, Gokhale et al. in 2023 [
29] applied a transfer learning framework with TFT to forecast electricity consumption in home energy management systems, achieving a 15% reduction in mean absolute error and a 2% reduction in energy costs through improved predictive control. Similarly, Maragkos and Refanidis in 2025 [
30] conducted a comprehensive evaluation of state-of-the-art forecasting models, including TFT, on a dataset combining energy consumption and weather variables from 24 European countries, highlighting the robust performance of transformer-based models in complex multivariate scenarios.
Despite the demonstrated potential of these models, their application in the industrial manufacturing sector, particularly in quicklime production, remains underexplored. This study addresses this gap by implementing a Temporal Fusion Transformer to forecast energy demand in an operating quicklime plant using real industrial measurements collected during 2024. The dataset consists of hourly electrical consumption records obtained at the plant level, enriched with contextual information such as working schedules, production stages, and exogenous variables including coke consumption during the calcination process. By integrating multivariate time series with temporal and operational features derived from actual plant operation, the proposed approach captures realistic consumption patterns and variability inherent to industrial production environments.
Contributions
The main contributions of this study are summarized as follows:
Industrial probabilistic forecasting under discontinuous load dynamics. This work investigates short-term load forecasting in a stage-driven manufacturing environment characterized by fixed schedules, abrupt transitions, and structured inactivity periods. The study provides empirical evidence on how a transformer-based architecture performs under these non-continuous industrial conditions, which are rarely addressed in the prior forecasting literature.
Integration of operational covariates in a multi-horizon TFT framework. A forecasting framework based on the Temporal Fusion Transformer (TFT) is implemented by combining historical load, stage encoding, coke consumption, and known-future temporal variables. The model is configured for multi-horizon probabilistic prediction, enabling the analysis of heterogeneous inputs in a real industrial setting.
Evaluation beyond point-wise accuracy using cumulative energy fidelity. In addition to MAE and RMSE, the study introduces a cumulative energy deviation metric derived from hourly power integration, allowing the assessment of forecasting performance from an operational planning perspective, for which total energy preservation is critical.
Probabilistic forecasting and uncertainty calibration in industrial operation. The TFT model is evaluated using quantile regression to produce prediction intervals, and uncertainty quality is analyzed through coverage metrics, providing insight into risk-aware decision-making for energy management.
Interpretability-oriented assessment of industrial forecasting variables. The study examines the role of operational covariates within the TFT architecture, highlighting the relevance of stage-dependent dynamics and contextual variables in explaining industrial load behavior.
Transparent comparative evaluation against deep learning baselines. A consistent experimental protocol is implemented to compare TFT with LSTM, GRU, and N-BEATS models using identical data splits, covariates, and evaluation horizons, ensuring fairness and reproducibility.
The remainder of this paper is organized as follows.
Section 1 introduces the research context and motivates the need for accurate short-term electric load forecasting in stage-driven industrial processes.
Section 2 describes the materials and methods, including the data acquisition process from an operating quicklime plant, the construction of the hourly multivariate dataset, and the implementation of the Temporal Fusion Transformer (TFT) for multi-horizon and probabilistic load forecasting.
Section 3 presents and discusses the forecasting results, covering deterministic performance, cumulative energy deviation, probabilistic prediction intervals, and residual analysis. Finally,
Section 4 summarizes the main conclusions of the study and outlines potential directions for future research.
2. Materials and Methods
2.1. Dataset Description
An hourly dataset was collected from an operating quicklime production plant over a full year (2024) to characterize the electrical energy consumption of the manufacturing process under real industrial conditions. The recorded data reflect the actual operational behavior of the plant, including fixed working schedules, stage-based production logic, and abrupt load transitions associated with the activation and deactivation of process equipment.
Electrical energy consumption was measured at the main electrical distribution level, capturing the aggregated power demand of the production facility. The recorded load implicitly reflects the operation of multiple electric motors associated with different stages of the process, such as material reception and crushing, screening, post-screening handling, and kiln operation with grinding. Although individual motor measurements were not used as direct inputs to the forecasting model, the aggregated signal preserves the characteristic signatures of stage-dependent motor activity. The schematic representation of the hourly energy data acquisition and processing workflow for the quicklime production plant is shown in
Figure 1.
The production workflow is organized into four main stages: reception and crushing, screening, post-screening, and kiln with grinding. These stages are executed sequentially following the plant’s standard operating procedure. Production activities take place exclusively on working days (Monday to Friday) within a fixed daily schedule from 08:00 to 18:00, while nights, weekends, and non-operational periods are characterized by zero or near-zero electrical demand.
During active production hours, the measured electrical load exhibits natural variability caused by partial-load motor operation, process control actions, and transient events inherent to industrial operation. These fluctuations are directly reflected in the recorded dataset, allowing the forecasting model to learn realistic consumption patterns without the need for artificial load synthesis or stochastic simulation assumptions.
In addition to electrical measurements, an auxiliary operational variable representing hourly coke consumption during the calcination stage was included as an exogenous input. Coke constitutes the primary non-electrical energy input of the kiln, and its recorded usage provides valuable contextual information related to production intensity and process state. The inclusion of this variable enables a more comprehensive representation of the energy dynamics of the quicklime production process and supports subsequent analyses related to energy efficiency and operational planning.
The dataset was composed of hourly records, with each including information such as date, time, day of the week, active stage, coke consumption, and total electrical energy consumption. Below,
Figure 2 details a view of one day from the dataset:
The following
Figure 3 shows the behavior of the variable “total consumption (kW)” for one week from the dataset. Data is available for the first five days (Monday to Friday), while weekends correspond to non-operational periods with a near-zero recorded load.
2.2. Data Preprocessing and Treatment of Non-Operational Periods
All measurements were first aligned to a regular hourly time index using resampling. A distinction was made between missing records and true low-demand periods.
Non-operational periods, including nights and weekends, correspond to measured near-zero power demand, rather than missing data. These values were retained as valid observations to preserve the operational structure of the plant and avoid introducing artificial bias.
When gaps in the time series were detected within operational hours, they were treated as missing values and handled through removal/interpolation, depending on the length of the gap.
This preprocessing strategy ensures that zero values represent actual plant inactivity, rather than data absence, which is essential for preserving the statistical properties of the load profile and for training the forecasting models. Algorithm 1 shows the steps for data preparation and feature construction within the developed industrial load forecasting methodology.
| Algorithm 1 Data preparation and feature construction for industrial load forecasting. |
Input: Raw dataset with timestamp t, power demand (kW), production stage , coke consumption Output: Target series Y, covariates , , training and validation sets
- 1:
Resample to regular hourly time index - 2:
Identify missing timestamps and handle according to operational log - 3:
For non-operational hours (nights/weekends), keep measured near-zero values - 4:
Encode production stage using one-hot representation - 5:
Construct past covariates - 6:
Construct known-future covariates - 7:
Normalize continuous variables using scaler fitted on training data - 8:
Define input window length h - 9:
Define forecast horizon h - 10:
Split dataset into training and validation sets chronologically - 11:
Return prepared time series objects for model training
|
2.3. Temporal Fusion Transformer (TFT) Model
A Temporal Fusion Transformer (TFT) model was developed as a deep learning architecture designed for multivariate and multi-horizon time series forecasting. This means it can predict multiple future time steps while handling several variables simultaneously. TFT differs from traditional models by providing not only point forecasts but also prediction intervals, which help estimate uncertainty.
2.4. Main Components of the TFT
First, the model incorporates gating mechanisms that dynamically control the flow of information within the network. By means of gated residual blocks, the TFT can skip unnecessary transformations under certain conditions, thus adapting its depth and complexity to the characteristics of the dataset. This flexibility allows the model to behave almost linearly in simple or noisy scenarios while exploiting its nonlinear representational capacity when more complex dependencies are present.
A second innovation lies in the use of variable selection networks, which identify the most relevant inputs for prediction at each time step. This mechanism applies both to static covariates and to time-dependent variables, distinguishing between those observed in the past and those known a priori for the future. In this way, the model improves efficiency by focusing on the most informative signals, while at the same time enhancing interpretability by explicitly revealing which variables drive the forecasting process.
The third essential component is the set of static covariate encoders, designed to integrate information that does not change over time, such as the geographic location or the intrinsic nature of the entity under study. These encoders generate context vectors that condition various parts of the network, ensuring that static features appropriately influence temporal dynamics and contribute to the construction of richer internal representations.
Fourth, TFT relies on temporal processing mechanisms to capture both local and long-term dependencies. A sequence-to-sequence architecture based on recurrent networks is employed to extract short-range temporal patterns, while an interpretable multi-head self-attention block captures broader relationships across time. The integration of these two components enables the model to simultaneously identify short-term fluctuations and long-term seasonal or structural effects, thereby addressing the inherent complexity of multivariate time series.
Finally, TFT produces prediction intervals through quantile forecasting, which extends its practical applicability by providing not only point estimates but also probabilistic ranges for future values. This probabilistic perspective is particularly valuable in real-world scenarios, as it allows practitioners to quantify uncertainty and make more robust decisions in environments characterized by high variability and risk. The architecture of the Temporal Fusion Transformer (TFT) is shown in
Figure 4.
2.4.1. Quantile Regression
TFT does not produce a single point estimate but generates multiple quantiles (e.g., 10%, 50%, 90%) to estimate a likely range of future values. This is trained using the pinball loss function, which is effective for uncertainty estimation.
2.4.2. Interpretability
TFT is designed with interpretability in mind, achieved through:
A variable selection mechanism that assigns global weights.
An attention mechanism tailored to trace key inputs. This allows the model to not only predict but also explain what information it used and why.
2.4.3. Implementation
In this work, the TFT model was implemented using the Darts library (version 0.30.0) in Python (version 3.10):
2.5. Energy Deviation Metric
Although the target variable corresponds to the hourly average power demand,
, expressed in kW, cumulative energy over a forecasting horizon is obtained by temporal integration:
where
h, yielding energy in kWh.
The predicted cumulative energy is defined as:
The energy deviation metric reported in this study is computed as:
This metric evaluates the ability of the forecasting model to preserve total energy over the operational horizon.
2.6. TFT Training and Forecasting Pipeline
The implementation of the forecasting framework begins with the initialization of the computational environment and the required modeling tools. Core libraries for data handling and visualization (pandas, numpy, and matplotlib) are employed, together with the main components of the Darts framework, including TimeSeries for structured time series representation, TFTModel for model construction, and QuantileRegression for probabilistic learning.
Following this initialization stage, the forecasting procedure is organized into six sequential steps covering data preparation, covariate construction, dataset partitioning, model configuration and training, multi-horizon inference, and result export. These steps define the operational workflow adopted in this study and are illustrated in
Figure 5. The detailed description of each step is provided below.
In Step 1, the dataset is prepared. The date column is converted to datetime format and set as the DataFrame index. The target series is built from the total consumption (kW) column. Temporal covariates are then created from columns such as hour, day of the week, coke usage (coke-kg), and production stage, which is encoded using a one-hot categorical representation to preserve stage independence and avoid ordinal bias. These covariates are transformed into a multivariate TimeSeries structure to be used as additional inputs for the model.
In Step 2, the dataset is split into training and validation sets. In this case, data prior to 1 December 2024 is used to train the model, while data after that date is reserved for validation and prediction. This split is applied to both the target series and the covariates.
In Step 3, the Temporal Fusion Transformer (TFT) model is defined. This model combines LSTM networks with Transformer-style attention mechanisms and is specially designed to capture temporal and seasonal patterns in multivariate time series. Its hyperparameters are configured, including input length (168 h, equivalent to one week), forecast horizon (24 h), number of hidden layers, batch size, and number of training epochs. Quantile Regression is also specified as the probabilistic estimation technique, and the model is trained using fit().
In Step 4, the trained model is used to forecast energy consumption for the next 24 h, using the most recent covariates (cov-val).
In Step 5, a plot is generated, showing the actual training data from the last week alongside the 24 h prediction and allowing a visualization of the model’s ability to follow the consumption pattern.
Finally, in Step 6, the prediction is exported to a CSV file with columns for the date and the predicted consumption-(kW). This is useful for further analysis, reporting, or integration into energy monitoring systems. Algorithm 2 presents the methodological steps for multi-horizon TFT forecasting and evaluation.
| Algorithm 2 Multi-horizon TFT forecasting and evaluation. |
Input: Target series Y, covariates Output: Point forecasts , quantile forecasts , evaluation metrics
- 1:
Initialize TFT model with hidden size d, attention heads h - 2:
Set quantile regression loss (pinball loss) for - 3:
Train model using training set with input window L and horizon H - 4:
Apply early stopping based on validation loss - 5:
For each day in evaluation week do - 6:
Observe last L hours of Y - 7:
Predict next H hours - 8:
Store predicted median and quantiles - 9:
End for - 10:
Compute MAE and RMSE over evaluation horizon - 11:
Compute cumulative energy: - 12:
- 13:
Compute Prediction Interval Coverage Probability (PICP) - 14:
Return forecasts and performance metrics
|
2.7. Model Configuration and Training Procedure
The Temporal Fusion Transformer (TFT) model was implemented using the Darts framework (Python, PyTorch backend). The forecasting setup employed an input window length of 168 h (one week of historical context) and an output horizon of 24 h (direct multi-step prediction).
The main hyperparameters were configured as follows: hidden size = 32, number of LSTM layers = 1, number of attention heads = 4, dropout rate = 0.1, batch size = 64, learning rate = , and number of training epochs = 50. Relative positional encoding was enabled through the add_relative_index option.
The model was trained using quantile regression with quantiles to obtain probabilistic forecasts. Optimization was performed using the Adam optimizer with a fixed learning rate and a deterministic random seed (42) to ensure reproducibility.
Chronological splitting was applied for training and validation. For each forecast date, the training set included all observations up to the forecast start minus the input window, while the validation horizon consisted of the subsequent 24 h. Rolling weekly backtesting was performed to evaluate generalization performance across multiple operational cycles. The main configuration parameters of the TFT model are summarized in
Table 1.
The architectural framework of the Temporal Fusion Transformer (TFT) utilized in this study is illustrated in
Figure 6. The model employs a multi-horizon forecasting approach that effectively integrates static covariates, known future inputs, and observed historical data. Key components include Gated Residual Networks (GRN), which enable the suppression of irrelevant inputs through sophisticated variable selection blocks, and a Temporal Self-Attention mechanism. This attention layer is critical for identifying long-term dependencies and seasonal patterns within the quicklime production energy cycles. By utilizing recurrent layers for local processing and self-attention for global dependencies, the TFT architecture ensures a robust multi-quantile probabilistic output, allowing for a comprehensive quantification of uncertainty across the forecasting horizon.
2.8. Forecasting Strategy and Rolling Evaluation
The TFT model is configured to produce direct multi-horizon predictions with an output length of 24 h per inference step. Rather than recursively feeding predictions into subsequent time steps, the model generates each 24 h forecast in a single forward pass using observed historical data.
Weekly prediction curves are constructed using a rolling forecasting scheme. For each day in the evaluation period, the model receives a fixed input window of 168 h (one week of past observations) and produces a direct 24 h forecast. The input window is then shifted forward by 24 h, and the procedure is repeated sequentially across the week.
This strategy results in a series of overlapping historical windows but non-overlapping forecast horizons, ensuring independence between daily predictions while preserving temporal continuity in the evaluation.
Teacher forcing is applied during model training as part of the sequence-to-sequence learning process. However, during inference and evaluation, forecasts are generated without teacher forcing, relying exclusively on observed past inputs and known covariates.
2.9. Baseline Models and Comparative Experimental Setup
To assess the performance of the proposed TFT framework, a comparative study was conducted using three deep learning baselines: LSTM, GRU, and N-BEATS.
All models were trained and evaluated under a consistent experimental protocol. Each model received the same historical input window of 168 h and generated forecasts over a one-week horizon (168 h). Chronological splitting was applied, where training data included all observations prior to the forecast start date, and evaluation was performed on the subsequent period.
Scaling and preprocessing procedures were identical across models to ensure comparability. Training was performed for 30 epochs using the Adam optimizer, with a batch size of 64 and a learning rate of
. A fixed random seed (42) was used to guarantee reproducibility.
Table 2 shows the hyperparameters and input configuration for baseline models.
Model-specific configurations were defined as follows:
TFT was implemented as a probabilistic multi-horizon forecasting model using quantile regression and past covariates.
LSTM and GRU were implemented as autoregressive sequence models using known future covariates.
N-BEATS was implemented as a univariate deep forecasting model without exogenous covariates.
2.10. Probabilistic Forecasting and Uncertainty Estimation
Prediction intervals were derived from the probabilistic predictive distribution learned through quantile regression. In practice, the Darts probabilistic prediction interface was used with samples drawn from the fitted likelihood to estimate empirical prediction intervals (P10–P90) and median forecasts. This sampling procedure serves only to approximate the predictive distribution implied by the quantile-regression model.
Let
denote
samples drawn from the predictive distribution at horizon
h. The P10–P90 interval is computed as:
and the median forecast corresponds to
.
This formulation provides reproducible uncertainty quantification and allows consistent comparison across forecasting models.
To further evaluate the calibration and sharpness of the probabilistic forecasts, two additional metrics were computed for the P10–P90 prediction intervals.
Prediction Interval Coverage Probability (PICP) measures the empirical proportion of observations that fall within the prediction interval:
Mean Prediction Interval Width (MPIW) quantifies the average width of the prediction intervals:
These metrics allow evaluating both the reliability (coverage) and sharpness (interval width) of the probabilistic forecasts.
3. Results
3.1. Weekly Energy Consumption Forecasting
Figure 7 illustrates the one-week energy consumption forecasting performance of the Temporal Fusion Transformer (TFT) model by comparing the predicted median profile with the corresponding measured data. The predicted trajectory closely follows the actual consumption pattern, accurately capturing the daily activation and shutdown of production stages, as well as the magnitude and timing of peak demand periods. Minor deviations are observed during rapid load transitions, which are primarily associated with abrupt changes in motor operation and process dynamics. Overall, the strong agreement between the predicted and observed profiles demonstrates the model’s ability to learn the temporal structure and operational constraints of the quicklime production process, confirming its suitability for short-term industrial energy forecasting applications.
3.2. Forecast Error Metrics
The forecasting performance of the proposed Temporal Fusion Transformer (TFT) model was evaluated not only using point-wise accuracy metrics but also in terms of cumulative energy deviation, as summarized in
Table 3. When assessing the prediction over a full one-week horizon, the model achieved a global energy error of 4.78%, indicating a very accurate preservation of the overall energy balance. This result confirms the model’s ability to correctly identify non-operational periods, such as nights and weekends, while maintaining consistency in the aggregated energy estimation.
When the evaluation was restricted to active production hours only, corresponding to working days between 08:00 and 18:00, the energy error increased slightly to 5.25%. This behavior is expected due to the higher variability of the industrial process during operational stages, including partial-load motor operation and abrupt transitions between production phases. Nevertheless, the obtained value remains well below typical error margins reported in industrial energy forecasting studies, demonstrating that the proposed model reliably captures the cumulative energy demand during active operation.
The relatively small difference between global and active-hour energy errors highlights the robustness of the TFT model in balancing structural accuracy and operational variability. While short-term peak values may be mildly smoothed, the model effectively preserves the total energy consumption over the evaluation period. This characteristic is particularly advantageous for planning-oriented applications, such as energy procurement, tariff optimization, and production scheduling, for which cumulative energy accuracy is more critical than instantaneous peak matching.
Overall, the obtained results demonstrate that increasing the number of training epochs significantly enhances the model’s capability to learn complex load patterns and stage-dependent consumption dynamics. With both global and active energy errors remaining below 6%, the proposed TFT framework provides a reliable and practically relevant solution for short-term energy forecasting in quicklime production processes.
It is important to note that the reported energy error metrics correspond to two different evaluation weeks. The detailed TFT performance presented in
Table 3 was computed for the week starting on 1 December 2024, which was selected as the primary evaluation period for analyzing cumulative energy accuracy and uncertainty behavior. In contrast, the comparative analysis across forecasting models reported in
Table 4 was conducted using a separate evaluation window starting on 15 December 2024, in order to assess model robustness and generalization under different operational conditions.
3.3. Comparison of Deep Learning Models for Industrial Energy Forecasting
Table 4 summarizes the comparative performance of the proposed TFT model against three baselines (LSTM, GRU, and N-BEATS) over a one-week horizon after training all models for 30 epochs. For transparency, all models were evaluated on the same shared evaluation week to ensure a consistent comparison of forecasting performance. In terms of point-wise accuracy, the recurrent models achieved the best results, with LSTM and GRU yielding the lowest MAE (1.31–1.34 kW) and RMSE (2.84–2.99 kW), confirming the effectiveness of gated and memory-based recurrent dynamics for tracking rapid load variations and peak magnitudes during operation.
When shifting the evaluation criterion to cumulative energy deviation, all models exhibited low global energy errors (below 2.2%), indicating an overall strong preservation of the weekly energy balance. The N-BEATS model achieved the lowest global energy error (1.45%), while TFT remained highly competitive (1.90%) despite its higher point-wise errors. This result suggests that the transformer-based architecture can preserve aggregated demand even when short-term peak values are mildly smoothed.
A more stringent assessment restricted to active production hours (08:00–18:00, weekdays) reveals increased sensitivity to operational variability. In this regime, LSTM and GRU retained the best cumulative accuracy (2.05% and 2.74%, respectively), whereas TFT showed a higher active-hour energy error (3.88%). This behavior is consistent with the model’s tendency to regularize extreme values and to rely on covariate-driven structure, rather than purely local amplitude matching.
Overall, the results highlight a clear trade-off: recurrent baselines provide superior instantaneous tracking, whereas the TFT framework offers competitive energy consistency, together with probabilistic forecasting and interpretability via quantile regression and attention mechanisms. These additional capabilities are particularly relevant for decision-support scenarios such as tariff-aware planning, energy procurement, and risk-aware operational scheduling, where uncertainty quantification and transparency are critical. Evaluating the models on two non-overlapping weeks allows for the disentangling of model-specific behavior from week-dependent operational variability, strengthening the validity of the comparative conclusions.
3.4. Probabilistic Weekly Load Forecasting
Figure 8 illustrates the probabilistic weekly electric load forecasting results obtained with the TFT model for the period of 1–7 December 2024. The median prediction follows the daily operational cycles of the plant, accurately identifying start–stop patterns associated with working hours and inactivity during nights and weekends.
Prediction intervals (P10–P90), computed as described in
Section 2.10, exhibit a systematic widening during the onset of production stages and peak-demand periods. This behavior reflects increased operational variability and the nonlinear dynamics of a stage-driven industrial load, indicating that the model appropriately captures uncertainty under changing production conditions.
A quantitative evaluation shows that 86% of the observed hourly load values fall within the P10–P90 interval, suggesting a good probabilistic calibration of the forecasts. While some deviations are observed around sharp peaks, the model consistently captures the timing, duration, and intensity of active production periods, demonstrating robustness in an industrial environment characterized by repetitive yet non-stationary load patterns.
The second evaluation shown in
Figure 9, for the subsequent week of 15–22 December 2024, validates the performance of the previously trained and stored TFT model when reloaded for inference. The results confirm that the model retains its forecasting capability without retraining, which is a critical feature for real-world industrial applications where models must be deployed efficiently. Similar to the previous week, the forecasts align with the actual load curves, capturing the cyclical nature of production activities across weekdays while maintaining near-zero predictions during weekends. The quantile-based prediction bands again highlight areas of higher uncertainty at peak demand periods, suggesting that, while the model can reproduce the general consumption structure, extreme values remain more challenging to estimate with precision. This outcome reinforces the importance of probabilistic forecasting in industrial energy management, as it not only delivers point estimates but also quantifies the reliability of predictions across operational contexts.
To further evaluate the calibration of the probabilistic forecasts, the Prediction Interval Coverage Probability (PICP) and Mean Prediction Interval Width (MPIW) were computed for the P10–P90 interval. The obtained values are reported in
Table 5. Given the nominal coverage of 80%, the empirical coverage indicates well-calibrated prediction intervals with a moderate uncertainty width, providing reliable probabilistic information for operational decision-making.
3.5. Residual Analysis of the Forecasting Model
The statistical distribution of the forecasting residuals provides valuable insight into the predictive performance and robustness of the proposed model. As illustrated in
Figure 10, the residuals are predominantly concentrated around zero, indicating a high level of accuracy and an overall unbiased behavior in the majority of the predictions. The narrow interquartile range reflects a low dispersion of errors under normal operating conditions, while the mean residual, slightly shifted towards positive values, suggests a marginal tendency of the model to overestimate the electrical demand. Nevertheless, the presence of isolated outliers with larger positive and negative deviations highlights the occurrence of atypical operating scenarios or transient load fluctuations that are not fully captured by the model. Despite these extreme values, the overall residual structure confirms the model’s capability to reliably represent the underlying consumption dynamics, supporting its suitability for energy forecasting applications in industrial environments.
3.6. Model Explainability and Attention Analysis
The explainability analysis provided by the Temporal Fusion Transformer (TFT) reveals the internal mechanisms driving the forecasting process. Variable selection results (
Figure 11) indicate that the production stage is the most influential encoder variable, followed by day-of-week and hour-of-day features, while autoregressive dependence on past load values plays a secondary role. This confirms that the electrical demand of the quicklime process is primarily stage-driven, rather than purely autoregressive.
An attention analysis further shows that the model relies on recurrent temporal patterns, rather than immediate past observations. The mean attention distribution (
Figure 12) highlights strong contributions from historical windows associated with daily and weekly operational cycles. Horizon-specific attention patterns remain consistent across prediction steps, suggesting stable temporal dependencies, as indicated in
Figure 13.
The attention heatmap (
Figure 14) reveals clear vertical structures aligned with repeated operational cycles, indicating that the model leverages analogous time periods from previous days to generate forecasts. This behavior confirms that the TFT effectively captures structured industrial dynamics and supports its interpretability in operational contexts.
4. Conclusions
This study presented a short-term electric load forecasting framework for a quicklime production process based on real industrial data collected during 2024, integrating operational measurements with a Temporal Fusion Transformer (TFT) deep learning architecture. The dataset captures the actual behavior of the plant under normal operating conditions, including a stage-driven production logic, a plant-level aggregated electrical load, fixed weekday working schedules from 08:00 to 18:00, and non-operational periods during nights and weekends. As a result, the analyzed data reflect realistic industrial characteristics such as intermittent operation, abrupt load transitions, and bounded variability inherent to energy-intensive manufacturing systems.
The proposed TFT model demonstrated a strong capability to learn the temporal structure and operational constraints of the industrial load profiles. The weekly forecasting results show that the model accurately reproduces daily start–stop patterns, peak demand periods, and weekend inactivity. Beyond conventional point-wise accuracy metrics, cumulative energy deviation was explicitly evaluated to assess the preservation of the overall energy balance over planning-relevant horizons. The obtained energy errors of 4.78% over the full week and 5.25% when restricting the evaluation to active production hours confirm that the model reliably captures aggregated plant-level energy demand, even in the presence of non-linear load transitions and process variability.
The probabilistic formulation based on quantile regression further enhances the practical value of the proposed approach. Prediction intervals were derived from the quantile-based predictive distribution learned by the TFT model, with empirical P10–P90 bands estimated through sampling from the fitted likelihood. The uncertainty intervals systematically widen during the onset of production stages, where load variability and nonlinear dynamics are more pronounced. The empirical coverage of the nominal 80% prediction interval reached 86% (PICP = 0.86), with a mean prediction interval width of 11.2 kW (MPIW), indicating well-calibrated probabilistic forecasts with moderately conservative uncertainty representation under real industrial operating conditions.
A comparative analysis against established deep learning baselines, including LSTM, GRU, and N-BEATS, revealed a meaningful trade-off between instantaneous accuracy and cumulative energy consistency. While recurrent models—particularly GRU—achieved lower MAE and RMSE values and thus better peak-tracking performance, the TFT model exhibited superior global energy consistency across the one-week forecasting horizon. This behavior highlights the suitability of TFT for planning-oriented industrial applications such as energy procurement, tariff optimization, infrastructure sizing, and production scheduling, where preserving cumulative demand is often more critical than the exact reproduction of short-term peaks.
Overall, the results demonstrate that transformer-based architectures, when applied to real industrial datasets enriched with operational covariates, offer a robust and interpretable solution for short-term energy forecasting in stage-driven manufacturing environments. Future work will focus on extending the analysis to longer operational periods, incorporating additional exogenous variables such as electricity prices and ambient conditions, evaluating and comparing probabilistic forecasting performance across multiple model families to assess uncertainty calibration more comprehensively, and coupling the forecasting framework with optimization or reinforcement learning strategies to enable automated, tariff-aware scheduling and energy-efficient operation in industrial production systems.