1. Introduction
Accurate and robust forecasting of gas consumption has become increasingly critical in the context of modern energy systems, in which supply reliability, market efficiency, and environmental responsibility intersect.
With the growing complexity of consumption behavior, influenced by factors such as urbanization, climate variability, and diversified consumer categories, traditional forecasting approaches often fall short in capturing the underlying dynamics of energy demand.
In response to these challenges, deep neural network (DNN) models have emerged as a promising alternative, offering the capacity to model complex temporal relationships and non-linear dependencies. Architectures such as Seq2Seq with attention, TiDE, and Temporal Fusion Transformers (TFT) enable learning from rich, multi-dimensional datasets that include historical consumption, external variables like temperature, and categorical consumer attributes.
The potential of DNNs extends beyond technical innovation; it holds tangible value in meeting the regulatory and operational demands of the gas sector. Regulators increasingly require energy providers to deliver accurate, transparent, and timely forecasts that support network stability, market transparency, and environmental compliance.
From forecasting aggregated demand for capacity planning to ensuring adherence to balancing responsibilities and emission targets, DNN-based models can provide both the precision and interpretability needed for modern regulatory frameworks.
To address the challenges of large-scale, regulatory relevant gas consumption forecasting, we selected three deep learning architectures—Seq2SeqPlus, TiDE, and Temporal Fusion Transformer (TFT)—based on their proven ability to handle multivariate time series, support structured inputs, and offer model transparency. Unlike earlier models such as DeepAR, N-BEATS, or Informer, which either focus on probabilistic autoregression, pure trend decomposition, or long-range attention alone, the chosen architectures provide a balance of temporal structure modeling, scalability across diverse consumers, and interpretability through attention mechanisms or SHAP-based feature attribution. These characteristics make them especially suitable for real-world deployment under data quality constraints and regulatory requirements.
This study evaluates and compares these three DNN architectures using a large-scale, real-world dataset of over 100,000 consumers, aiming to determine their practical viability for gas consumption forecasting in operational and regulatory settings.
Focus is placed on how data quality, feature attribution and model robustness influence performance—providing actionable insights for real-world deployment.
The necessity of mid- and long-term forecasting of natural gas consumption is of crucial importance in the current conditions of Ukraine’s natural gas market as an integral part of the European Energy market. In this context, accurate forecasting ensures effective interaction among market participants and helps reduce operational and transaction costs as well as mitigates the risks of negative regulatory effects [
1].
In particular, the predicted consumption and distribution volumes of natural gas are used by the state regulatory authority—The National Energy and Utilities Regulatory Commission when setting tariffs for gas distribution system operators and the gas transmission system operator.
Currently, existing methodologies involve using historical data for tariff calculations, with periodic adjustments based on actual data from previous periods (mostly a calendar year). This approach creates discriminatory effects on consumers and leads to cash flow gaps for DSO.
Additionally, forecasting is essential for natural gas suppliers and traders to balance and reserve capacities. Furthermore, inaccurate forecasting increases suppliers’ costs due to penalties and the need to purchase additional gas volumes on the spot market [
2].
Recent advancements in gas consumption forecasting leverage a variety of machine learning (ML) techniques and hybrid models to enhance predictive accuracy. These approaches utilize historical consumption data, meteorological factors, and even social variables to create robust forecasting models. The following sections outline the key methodologies currently employed in the field.
Machine Learning Techniques:
Deep Learning Models: Techniques such as Long Short-Term Memory (LSTM) and Deep Neural Networks (DNN) have shown significant promise. For instance, a DNN incorporating social factors outperformed traditional models in forecasting natural gas consumption in Greece [
3].
Hybrid Models: Combining statistical methods with ML, such as Facebook’s Prophet and Holt–Winters Method, has improved accuracy in predicting natural gas demand [
4].
Change Point Detection:
Dynamic Adaptation—the integration of change point detection mechanisms allows models to adapt to shifts in consumption patterns, enhancing forecasting reliability in real-time scenarios [
5].
Comparative Studies:
Model Evaluation—a comprehensive review of various forecasting strategies highlights the effectiveness of hybrid models and the importance of data decomposition methods in improving prediction accuracy [
6].
Fan et al. proposed a deep reinforcement learning framework integrating demand forecasting and dynamic pricing for natural gas pipeline networks, optimizing system performance under physical constraints. Their method highlights the potential of combining deep learning and decision-making for demand response in complex gas systems [
7].
While these advanced techniques significantly enhance forecasting capabilities, challenges remain, particularly in data privacy and the interpretability of complex models. Addressing these issues will be crucial for the future development of gas consumption forecasting methodologies.
2. Actual Regulatory Use Cases of Consumption Forecasting
There are several major use cases concerning a proper forecasting of natural gas consumption engaging all the participants of the gas market—consumers, system operators, suppliers, and state regulatory authority. Let us consider them deeply. The first one is Forecasting for Tariff Setting. The National Energy and Utilities Regulatory Commission (hereinafter—NEURC) uses gas consumption forecasts to determine fair and economically justified tariffs for gas distribution and transportation.
Tariffs for gas distribution system operators (DSOs) and the gas transmission system operator (TSO) are set for a period of 1 to 5 years based on predicted gas consumption and transmission capacity volumes. In this case, precious long-term forecasts of gas consumption help estimate expected revenues and costs for network operators, ensuring that tariffs cover operational expenses while maintaining affordability for consumers [
8,
9].
But there are certain challenges with the current approach, such as historical data dependency, delayed adjustments, and consumer inequality. The current methodology relies on past consumption data to determine tariffs, with periodic adjustments based on actual figures. This can be problematic in times of fluctuating demand.
If actual consumption deviates from forecasts, tariff corrections may take a full calendar year, leading to financial imbalances for operators. In addition, inaccurate forecasting can result in some consumer groups overpaying or underpaying, creating a discriminatory effect on different types of consumers.
The second use case considers Forecasting for Balancing and Capacity Reservation due to The Gas Transmission System Code requirements [
10]. Suppliers and network operators use day- and month-ahead gas consumption forecasts to ensure the gas system remains balanced and to optimize capacity reservation. Natural gas supply must match demand in real-time to maintain stable system pressure and prevent shortages or excess gas buildup, meanwhile suppliers must forecast demand accurately to avoid imbalances, which could lead to financial penalties.
While Capacity Reservation is maintained, gas suppliers must reserve transmission and distribution capacities in advance based on their expected demand. Incorrect forecasts may result in overbooking (leading to unnecessary costs) or underbooking (causing supply shortages and urgent, expensive spot market purchases).
Financial and operational risks may occur in case of poor forecasting. If suppliers do not accurately predict consumption, they may face penalties for imbalances. Unexpected demand surges require suppliers to purchase additional gas on the spot market, which often has significantly higher prices than long-term contracts. Inaccurate forecasts can create cash flow gaps, where suppliers either have excess gas, they cannot sell or face sudden high costs to meet unexpected demand.
Therefore, accurate gas consumption forecasting is critical for both regulatory tariff-setting and market-based balancing operations. For regulatory authorities and consumers, it ensures stable, fair tariffs that reflect actual market conditions. For suppliers, it minimizes costs, prevents penalties, and improves financial planning. Now, let us review up-to-date forecasting solutions applied to the subject of our research.
Our current study aims to find the best extant solutions to satisfy both short- and long-term forecasting demands. The key point of the possible approaches we consider is data, and its availability, robustness, and relevance. The volumes of data available to gas market participants and the state regulatory authority are enormous and constantly growing, although only a small part of them can be used for forecasting purposes (
Figure 1).
The TSOUA, as the authorized operator of the Information Platform, possesses large amounts of data on all the consumers of Ukraine—in particular, their territorial location, consumption volumes, and affiliation with DSO and gas suppliers [
11].
In turn, DSO and suppliers have more discrete data, but only regarding the consumers to whom they provide their services. At the same time, regulatory factors, such as the requirements of regulatory legal acts (for example, the methodology for calculating the volume of gas consumed in the absence of meter readings or existing technological limitations, or, the low level of coverage of household consumers with remote reading meters and the inability of DSO controllers to cover all consumers) cause the incompleteness of this data [
12].
The NERC consolidates these data through regulatory reporting submitted by market participants, forming a holistic picture of the regulated market, while making the information more general and publicly available. Meanwhile, current legislative restrictions, regarding the protection of personal data and commercial concerns, may further limit their use for forecasting purposes, which is to be mitigated by amending existing regulations. That is, the corresponding models must consider this incompleteness of the data, or it is necessary to additionally apply specified data wrangling and preprocessing solutions, which can lead to a significant increase in time and computational power.
The structure of consumer types and their share in the total consumption volume differs significantly in certain regions of Ukraine—commercial consumers prevail in the East and Center, and household consumers in the West and South.
This may be another significant factor that may influence the choice of forecasting approaches. Commercial consumers demonstrate more stable patterns of consumer behavior with low seasonality, while the patterns of household and utility consumers, on the contrary, have a pronounced seasonality and dependence on environmental temperature. Moreover, within households, separate groups whose consumption patterns are extremely different can be distinguished. This depends primarily on the purpose of consumption (home heating, water heating, cooking), volumes of consumption and housing floor area.
Under the conditions of state regulation of prices and tariffs for household consumers, which is currently applied in Ukraine, the natural gas cost factor has a negligible statistical impact on their consumption level, and therefore we will not use it in this study [
13].
Summarizing the above, we can outline the framework of our requirements for predictive approaches—the ability to process large volumes of raw incomplete data, detect and consider seasonal trends, as well as accept additional features—environmental temperature, category, etc.
Moreover, we need a single predictive model that considers all of more than 100,000 consumers’ unique time series simultaneously attempting to capture the core patterns that govern the series, thereby mitigating the potential noise that each series might introduce. At the output, the solutions should provide short- and long-term forecasts of consumption volumes both in terms of distinct consumers, consumer categories and, in general, for certain regions. Since the task is quite complex, in our research we will narrow the scope and focus on the most essential part of it—household consumption forecasting.
3. Materials and Methods
Currently, time series forecasting problems can be solved using a wide range of approaches. Among them, we can mention classical statistical approaches—ARIMA, which is effective for short-term forecasting when consumption data follows a trend and seasonality and SARIMA—an extension of ARIMA that accounts for seasonal patterns in gas demand. Exponential Smoothing (Holt–Winters method), which is useful for capturing trends and seasonal variations, regression models—Multiple Linear Regression (MLR) that relates gas consumption to external variables like temperature, industrial activity, and population, as well as Generalized Additive Models (GAMs) which is more flexible than linear regression, allowing non-linear relationships between predictors—can be added to the list.
These methods have been widely used due to their interpretability and relatively low computational requirements and deliver an acceptable level of precision. A significant limitation of these methods is the inability to train global models that allow considering the behavior of many consumers simultaneously, and, therefore, they are unsuitable for achieving our goals.
In the last decade, due to the impressive development of computing power, completely new approaches have appeared, which are based on machine learning and artificial intelligence.
Support Vector Machines (
SVMs), e.g., are useful for medium-term forecasting, especially when relationships between variables are complex, Random Forest and Gradient Boosting Trees (XGBoost 3.0.2, Tianqi Chen, University of Washington; LightGBM 4.6.0, Microsoft Research Asia, Redmond, Washington, USA) handle non-linear relationships well and can incorporate many external variables, while Feedforward Neural Networks (
FNNs) are effective for learning complex patterns but require significant data and Long Short-Term Memory Networks (LSTMs) and Gated Recurrent Units (GRUs) are ideal for sequential and time-series forecasting. But the most cutting edge solutions are based on the concept of Deep Neural Networks [
14,
15].
These methods require huge amounts of data and immense computational facilities, but in return a higher level of prediction accuracy can be granted.
Deep neural network methods in gas consumption forecasting. So, let us get closer to the most up-to-date realm of time series forecasting approaches, deep neural network models, in particular Time-series Dense Encoder, Temporal Fusion Transformer, and Sequence to Sequence plus Attention.
The evolution of deep learning models for time series forecasting reflects a gradual shift from sequence modeling roots in natural language processing to highly specialized architectures tailored for temporal data. Sequence-to-Sequence (Seq2Seq) models were initially developed for machine translation and later adapted to time series forecasting. By incorporating the attention mechanism, these models overcame limitations of fixed-length context vectors, allowing them to selectively focus on relevant time steps—an innovation that improved their ability to handle long-term dependencies.
Building on these foundations, the Temporal Fusion Transformer (TFT) was introduced as a purpose-built architecture for interpretable multivariate time series forecasting. It fully integrates attention mechanisms—not only temporally, but also across input features—combined with gating and variable selection layers to enhance both performance and interpretability. TFT represents a significant leap in leveraging attention for structured time-series tasks.
In contrast, TiDE (Time-series Dense Encoder) emerged more recently as a minimalist, fully feedforward alternative to attention-based models. Eschewing traditional sequence-to-sequence and attention architectures, TiDE employs learned temporal embeddings and multi-layer perceptrons (MLPs) to capture time-dependent patterns.
It is designed to be computationally efficient and particularly well-suited for long-horizon forecasting, showing that competitive accuracy can be achieved without explicit recurrence or attention.
Time-series Dense Encoder (TiDE) [
16]
The deep learning model TiDE is designed to address the limitations of both linear models and Transformer-based approaches in long-term time-series forecasting [
17].
While recent work demonstrated that simple linear models could outperform Transformers on certain benchmarks, linear methods fail to capture non-linear dependencies or leverage covariates effectively. TiDE bridges this gap by introducing a dense Multi-Layer Perceptron (MLP)-based encoder–decoder framework that handles non-linear patterns while maintaining scalability. TiDE encodes the past of a time-series along with covariates using dense MLP’s and then decodes the encoded time-series along with future covariates.
Key Innovations. TiDE employs channel-independent processing, where each
-th time-series is modeled separately using its past observations
, dynamic covariates
, and static attributes
, while sharing global weights across the dataset, i.e.:
where
denotes the set of unique entities in a given time-series dataset,
is a forecasting model.
The architecture relies on residual MLP blocks, which consist of a ReLU-activated hidden layer, a linear skip connection, and dropout with layer normalization for stable training.
The model operates in two phases: an encoding phase that compresses historical data and covariates into a low-dimensional latent representation, and a decoding phase that maps this representation to future predictions using projected future covariates. This design ensures efficient non-linear modeling while maintaining interpretability and scalability.
TiDE Architecture. Encoding Stage. Feature Projection. Dynamic covariates
of time-series
at time
are compressed via a residual block to reduced dimension
:
This avoids the high dimensionality of flattened raw covariates.
Dense Encoder. The encoder stacks projected covariates, static attributes, and past observations, processing them through
residual blocks:
Decoding Stage. Dense Decoder. The latent representation
is transformed through
residual blocks into a tensor
, where each column
corresponds to a future time step:
Temporal Decoder. For each horizon step
, a residual block combines
with projected future covariates
:
This “highway” connection ensures direct covariate influence.
Global Residual Connection
A linear projection of the look-back window is added to the predictions, ensuring the model subsumes linear baselines.
Training and Evaluation. TiDE is trained using mini-batch gradient descent with root mean square error (RMSE) loss, where each batch contains multiple time-series segments consisting of look-back windows and their corresponding forecast horizons, allowing overlapping training sequences for comprehensive learning. The model’s performance is evaluated through rolling-window validation, testing all possible consecutive (look-back, horizon) pairs in the test set to thoroughly assess forecasting accuracy. This evaluation approach, consistent with established time-series forecasting practices, can also be applied to a validation set for hyperparameter optimization and model selection.
Temporal Fusion Transformer (TFT) [
18].
The Temporal Fusion Transformer (TFT) is designed with specialized components to handle diverse input types (static, known, and observed) for robust time-series forecasting. It adopts quantile regression to multi-horizon forecasting. Each quantile forecast takes the form
where
is predicted
q-th quantile for entity
at time
,
is a forecasting model,
is forecast horizon,
are past targets,
are past unknown observed inputs,
are past and future known inputs,
are static covariates, and
is a set of unique entities in a given time-series dataset. The model outputs probabilistic forecasts through quantile predictions (10th, 50th, and 90th percentiles), providing both point estimates and uncertainty quantification.
Key Innovations. TFT hybrid architecture strategically combines long short-term memory (LSTM), which excel at capturing local temporal patterns, with Transformer components that model long-range dependencies, creating a powerful framework for complex forecasting tasks.
For enhanced interpretability, TFT employs variable selection networks to identify important features and multi-head attention to reveal meaningful temporal relationships like seasonality trends. These innovations collectively enable TFT to handle complex forecasting tasks while maintaining computational efficiency and model transparency.
TFT Architecture. TFT employs gating mechanisms to dynamically adjust network complexity, variable selection networks to prioritize relevant features at each step, and static encoders to integrate time-invariant metadata. For temporal processing, TFT combines a sequence-to-sequence layer for local patterns with an interpretable multi-head attention block for long-range dependencies.
Gating mechanisms. To dynamically assess the importance of features, TFT employs Gated Residual Networks (
GRNs). The
GRN operates on an input vector
a and an optional context vector
c through the following formulation:
where
ELU is the Exponential Linear Unit activation function,
are intermediate layers, LayerNorm is standard layer normalization, and ω is an index to denote weight sharing [
19,
20]; Gated Linear Units (
GLU) for input
take the form:
where
is the sigmoid activation function,
,
are the weights and biases,
is the element-wise Hadamard product, and
is the hidden state size.
Variable selection networks TFT determines the importance of each input variable through its variable selection networks, which analyze both static features and time-varying inputs for every prediction. This process reveals which factors most influence forecasts and filters out irrelevant or noisy data that could reduce accuracy.
The model computes variable selection weights by processing both the flattened vector
of all past inputs at time t and an external context vector
through a
GRN, followed by a Softmax normalization:
where
represents the vector of variable importance weights,
is the static context vector from the static covariate encoder. For static variables,
is excluded since they already contain static information.
Each feature
undergoes additional non-linear processing at each time step via its own
GRN:
The processed features are then combined using the selection weights:
where
is the
-th element of the weight vector
.
Static covariate (attribute feature) encoders. TFT uses dedicated GRN encoders to transform static metadata into context vectors that guide temporal variable selection, local feature processing, and static–temporal fusion in the decoder.
Multi-head attention mechanism. Attention mechanisms scale values based on relationships between keys and queries in the following way:
where
Q (Query) represents the current focus of the attention mechanism,
K (Keys) encodes the content of all time steps,
V (Values) contains actual information to aggregate,
is a scaling factor for stable gradients.
The multi-head attention approach improves upon standard attention by employing parallel attention heads that each focus on distinct feature representations. The outputs of different heads are then combined via concatenation [
21].
Temporal fusion decoder uses different layers.
Quantile outputs. TFT produces prediction intervals alongside point forecasts by directly outputting multiple percentiles (e.g., 10th, 50th, 90th) at each time step, computed via a linear transformation of the temporal fusion decoder’s output.
Training and Evaluation. TFT is trained using a quantile loss function that jointly optimizes prediction percentiles, enabling probabilistic forecasting with uncertainty intervals. During evaluation, TFT employs rolling-window validation to assess performance across all forecast horizons, while attention weights and variable selection networks provide interpretable insights into feature importance and temporal patterns.
Compared to TiDE’s RMSE-based training, TFT’s quantile loss offers richer uncertainty quantification but requires more computation due to its LSTM and attention components. The model’s performance is measured through horizon-specific metrics including quantile coverage, mean absolute error (MAE) for median predictions, and analysis of attention patterns for temporal dependencies.
Sequence-to-Sequence Plus Attention (Seq2SeqPlus) [
22,
23].
Key Innovations. Seq2SeqPlus introduces critical enhancements over the traditional Seq2Seq model, primarily through the attention mechanism, which allows dynamic focus on relevant input tokens during decoding, improving handling of long sequences [
24].
Transformer architecture replaces RNNs (Recurrent Neural Network) with self-attention, enabling parallel processing and capturing long-range dependencies more effectively.
Seq2SeqPlus Architecture. Seq2SeqPlus improves upon Seq2Seq by integrating attention, Transformer blocks, and hybrid mechanisms. It consists of two main components—Encoder Block and Decoder Block.
Encoder Block. The encoder processes sequential input data to generate contextual annotations. We use a bidirectional RNN (BiRNN) to capture both past and future temporal dependencies within a fixed window. Forward RNN processes the time-series
chronologically (from
to
), producing hidden states
that encode historical trends:
Backward RNN processes the time-series
in reverse (from
to
), generating hidden states
to incorporate future context within the window:
These states are concatenated into an annotation vector , which summarizes both past and future context around . The annotations are later used by the decoder to compute dynamic attention weights.
Decoder Block. The decoder generates future predictions step-by-step, conditioned on both past decoder states and relevant historical patterns identified by the attention mechanism. At each step
, the decoder computes the conditional probability of the next value
as:
where
is the decoder’s hidden state,
is the previous prediction, and
is a time-dependent context vector encoding historical observations attended.
The decoder state
updates recursively using:
where
is a recurrent unit (LSTM). The context vector
is a weighted sum of encoder annotations
:
with weights
computed via a soft attention mechanism:
here,
scores how well the context window around
aligns with the current forecast step
, parametrized by a feedforward network
.
Model performance evaluation
In this study, model evaluation focused on well-established error-based performance metrics such as MAE, RMSE, R
2, and WAPE, which directly measure the accuracy and reliability of forecasts in practical, interpretable terms [
25]. These metrics are widely accepted in both academic and applied forecasting domains—particularly for large-scale deep learning models—due to their robustness and relevance to operational decision-making. We deliberately did not apply traditional statistical significance tests (e.g.,
p-values for forecasted outcomes), as such methods are generally not standard in performance validation for time series models, especially when predictions are autocorrelated and models operate on overlapping sequences. Nonetheless, we acknowledge the potential value of statistical testing in post-forecast residual analysis, and future work may integrate runs-based tests, Geary’s Test or Two-dimensional Bit-sequence Analysis to further assess structural forecast reliability and distributional alignment over time [
26,
27].
4. Results
The primary raw data we used for research were contained in two original datasets. The first one consisted of 55 columns containing billing data of 105,527 households in Volyn region of Ukraine on monthly natural gas consumption from January 2019 to April 2023 (52 periods totally) in cubic meters. The dataset columns were as follows:
Column 1 “ID”—household ID;
Column 2 “Category”—household categories prescribed due to Gas Distribution System Code depending on the purpose of consumption (home heating, water heating, cooking), volumes of consumption and housing floor area [
28];
Column 3 “GDS”—Household affiliation with a gas distribution station (39 GDS totally);
Columns 4–55 “Consumption”—Monthly volumes of consumption from January 2019 to April 2023 in cubic meters.
We performed an exploratory data analysis that showcased the presence of seasonality and autocorrelation in consumption volumes, as well as a statistically significant level of correlation between the volume of natural gas consumed and environmental temperature in certain categories of household consumers (categories 4, 5, 6, 7, 10, 11, 12, 14, 15, 16, 17). (
Figure 2 and
Figure 3;
Table 1)
A slight level of correlation and autocorrelation is observed in consumer groups that use natural gas for cooking only (categories 1, 2, 8) or for cooking and/or water heating (categories 3, 9).
The second dataset consisted of 39 rows and 53 columns containing meteorological data of average monthly air temperature per month measured on each of 39 GDS in the period of January 2019 to April 2023. The dataset columns were as follows (
Figure 2 and
Figure 3):
Column 1—”GDS”—name of a gas distribution station;
Columns 2—53 “Temperature”—average monthly temperature measured on a gas distribution station.
Both primary datasets had been transformed into a long dataset consisting of 5,381,877 rows and five columns aiming for further model training processes.
The authenticity of our data origin caused missing values in certain months for some consumers totaling 113,014 rows, so we had to fix it for further development.
Due to our uncertainty regarding the reasons of NaN values origin—either because there was no gas consumption in particular months or the meter readings were not taken—we are to apply two approaches resulting in two different datasets the models to be trained on.
A simpler approach involves replacing the “NaN” values with “0” values. Given the small number of such values (about 2%), we can assume that this will not have a critical impact on the performance of the models.
A more sophisticated approach involves replacing NaN values using Household-Specific Seasonal Imputation and K-Nearest Neighbors (KNN) techniques [
29].
The Household-Specific Seasonal Imputation technique was chosen because it considers seasonality and fills missing values using the average gas consumption for the same month in other years for each consumer individually.
However, there was an issue—a certain part of a particular household values for given months in different years were missing in the original dataset. The solution was to apply a more complicated technique, the KNN, which uses patterns from similar households to estimate missing values and needs a large dataset with strong correlations. For our purpose, KNNImputer had been used with the following parameters n_neighbors = 7, weights = ‘uniform’, and metric = ‘nan_euclidean’, with fine-tuning performed by masking known values and selecting the n_neighbors that minimizes imputation error (RMSE).
Once imputation had been performed, we compared the original and imputed data to check if the imputed values follow the original distribution. The Mean and Variance Check results are presented in
Table 2, confirming the imputation was successful.
As a result of data wrangling and preprocessing, we obtained two final time series datasets (NaN values replaced by 0 and imputed within special techniques, respectively) on consumption data with pronounced seasonality, statistically significant correlation to temperature data and featuring consumption category data as a supporting attribute (
Figure 4). The final datasets’ columns were as follows:
Column 1 “ID”—household ID (Series Identifier);
Column 2 “Date”—month and year (Timestamp);
Column 3 “Temperature”—average monthly temperature measured on a gas distribution station the household is affiliated with (Covariate Feature);
Column 4 “Category”—household category (Attribute Feature);
Column 5 “Consumption”—Monthly volumes of consumption from January 2019 to April 2023 in cubic meters (Target Variable).
Now, they are ready to be performed in a model training process.
Three cutting age Deep Neural Network models, Seq2seqPlus, TiDE, and TFT, had been trained on the final datasets with following parameters:
Chronological data splitting was applied; the earliest 80% of rows were assigned to training, the next 10% to validation, and the most recent 10% to test.
Forecast horizon was set to 12 months pursuing to maintain a year-ahead forecasting and the context window that defines the input lags to the model for each time series was set to 24.
Data granularity—monthly
The RMSE metric was set as an optimization objective.
The compute engine parameters used for model training are as follows: GPU Type—NVIDIA L4 (4 vCPU, 2 Core, 16 GB), Number of GPUs—4, Data disk size—100 GB. The computational facilities were limited to 2 node hours and the number of training epochs to 10 per model.
After the training was performed and models were tested, we obtained the following results (
Table 3).
Model feature attribution is expressed using the SHAP methodology [
30].
SHAP (SHapley Additive exPlanations) is a powerful method for interpreting machine learning models, including deep neural networks, by quantifying the contribution of each feature to the model’s predictions. SHAP is based on Shapley values from cooperative game theory, which fairly distribute the prediction outcome among the input features. It explains how much each feature contributes (positively or negatively) to a given prediction.
The feature attribution of trained models is presented in
Figure 4.
Model feature attribution tells us how important each feature is when making a prediction. Attribution values are expressed as a percentage; the higher the percentage, the more strongly that feature impacts a prediction on average.
5. Discussion
Considering the results of our research, the following conclusions can be drawn.
A global modeling approach had been adopted in this study to leverage common seasonal, temperature-dependent, and structural consumption patterns observed across a large population of households. This decision was driven by the practical need for scalable, centralized forecasting models applicable to regulatory planning across regions. At the same time, we acknowledge the heterogeneity among consumer behaviors—especially across usage categories and geographic areas—and addressed this in part by incorporating category and temperature as model features. While global models efficiently generalize shared patterns, further research may explore cluster-based segmentation or hybrid approaches to better capture subgroup-specific dynamics without sacrificing scalability.
Past consumption is a dominant feature of all trained models, which may confirm their autoregressive origins and high importance of historical consumption data in our context. As well these demonstrate strong seasonal and temporal patterns of household consumption.
Meanwhile, temperature and category features are to be considered as supportive due to their moderate influence level—they provide useful fine-tuning rather than the primary signal.
This confirms the fact that our models are specifically designed to handle both autoregressive dependencies (past gas consumption) and exogenous factors (temperature, category, etc.), as well as their ability to learn temperature effects indirectly through past gas consumption.
Anyway, the importance of temperature feature was shown previously in our recent paper using the Granger Causality test, which means our models should benefit from including it explicitly [
13].
The consumption category feature is valuable as well because it helps differentiate between groups with strong and weak seasonal behavior. It prevents seasonality bias, ensuring that groups with non-seasonal behavior are still predicted accurately, and interacts with temperature, helping the model decide when temperature matters more or less.
Without this feature, our models would treat all consumers the same, potentially leading to overgeneralization when the model may only learn the dominant pattern (e.g., strong seasonality) and ignore less seasonal groups, or loss of detail when consumers with flat or irregular consumption patterns might not be predicted well.
There are several general trends we observed regarding models trained on “Totally Imputed” vs. “Zero filled” datasets due to the performance metrics obtained:
- -
Totally Imputed data consistently improves performance across all models compared to Zero filled data.
- -
R2 drops slightly with Zero filled, meaning the models explain less variance.
- -
Seq2Seq suffers the most from missing values, while TiDE is the most stable.
All models show approximately the same levels of accuracy, especially on processed and imputed data, which may indicate the acceptability of all described DNN techniques for achieving our goals using real life data.
When finally choosing a model for use at the deployment stage, one should consider such forecasting parameters as completeness and robustness of the data and forecasting horizon. Of course, further periodic monitoring of accuracy levels and retraining on new data is recommended to maintain the functional parameters of the models.
Model Insights and Suitability. The Seq2SeqPlus model with an attention mechanism demonstrated a strong performance in terms of MAE and RMSE, particularly on the imputed dataset. It effectively captured temporal dependencies and historical consumption trends, leading to reliable short-term forecasts.
However, its MAPE was comparatively high, suggesting that the model was less effective at capturing low-consumption fluctuations, a common issue when prioritizing absolute over percentage-based errors.
While Seq2SeqPlus may require additional feature engineering to generalize across consumption categories with non-seasonal behaviors, it remains a strong candidate for short-term aggregate forecasting, particularly when the input data is complete and well-processed.
TiDE emerged as the most balanced model across all evaluation metrics. It showed robust performance on both datasets, with minimal degradation in the presence of missing data, which points to its stability under real-world conditions. Although it did not outperform the other models in any single metric, its consistency and relatively low variance make it a reliable choice for general-purpose forecasting.
TiDE’s simplicity also offers faster training times and easier hyperparameter tuning, which are advantageous in operational environments that require frequent model retraining.
TFT model offered the greatest flexibility and interpretability, excelling in handling diverse feature types (static, known, and observed time-varying variables).
While slightly underperforming Seq2SeqPlus on some absolute metrics (MAE, RMSE), its ability to model complex temporal relationships and provide explanatory insight via attention mechanisms and SHAP analysis makes it valuable for long-term forecasting and strategic planning applications.
However, TFT’s complexity and sensitivity to imputation strategies make it more challenging to deploy in volatile or sparse data environments without additional regularization and tuning.
Based on the comparative analysis and the specific forecasting objectives the models can be recommended for different use cases (
Table 4):
For real-world deployment, it is advisable to consider MAE or WAPE as primary optimization objectives, as these metrics align more closely with minimizing total consumption error—critical for operational planning. Furthermore, an ensemble approach combining the strengths of Seq2SeqPlus and TiDE may offer a balance between responsiveness and robustness.
The influence of data preprocessing—particularly the treatment of missing values—proved to be a critical factor in model performance. All three models consistently showed better accuracy on the imputed dataset compared to the version where missing values were simply replaced with zeros. The difference was most apparent in the MAE and RMSE values, which substantially degraded on the NaN-filled data.
This suggests that imputation strategies should be treated as an integral part of model development, not just a preprocessing step.
Given the high seasonality observed in some consumer categories, imputing based on historical seasonal trends and similar profile behavior appears to preserve important patterns that the models can effectively learn from.
For large-scale utility forecasting applications, it is essential to adopt context-aware imputation methods, such as seasonal medians or clustering-based smoothing, to maintain model accuracy while minimizing information loss.
6. Conclusions
The application of deep neural network (DNN) models to gas consumption forecasting represents a significant advancement in aligning predictive analytics with the operational and regulatory needs of modern energy systems. Unlike traditional statistical approaches, DNN-based models—such as Seq2SeqPlus, TiDE, and TFT—offer the flexibility to incorporate complex temporal patterns, consumer heterogeneity, and external variables such as weather or calendar effects.
In regulated energy markets, where forecasting accuracy directly impacts supply planning, tariff setting, and network balancing obligations, the ability of these models to deliver reliable, high-resolution forecasts at scale is particularly valuable.
Moreover, explainable architectures like TFT further enhance transparency, supporting compliance with regulatory requirements for model interpretability and auditability.
Equally important, these models can be tuned to prioritize total consumption accuracy, which aligns well with regulatory metrics such as aggregate demand forecasting accuracy, under-/over-supply penalties, and carbon accounting targets.
As regulators increasingly demand real-time responsiveness, robust handling of incomplete data, and forward-looking risk modeling, DNN-based forecasting systems are poised to become a cornerstone of compliant, data-driven energy operations.
Ultimately, the successful deployment of these models requires more than technical precision—it demands careful integration with data governance, regulatory frameworks, and operational workflows. When implemented thoughtfully, DNN forecasting tools can significantly enhance both regulatory compliance and strategic energy planning in the gas sector.