An Optimized Deep Transformer Framework Using Informer Architecture for Accurate Temperature Forecasting

Noorani, Maryam; Mehrdoust, Farshid; Hamdi, Ilyes; Hamdi, Abdelouahed

doi:10.3390/a19060437

Open AccessArticle

An Optimized Deep Transformer Framework Using Informer Architecture for Accurate Temperature Forecasting

¹

Department of Applied Mathematics, Faculty of Mathematical Sciences, University of Guilan, Rasht P. O. Box 41938-1914, Iran

²

CentraleSupélec, Paris-Saclay University, 8-10 rue Joliot Curie, 91190 Gif-sur-Yvette, France

³

Department of Mathematics and Statistics, College of Arts and Sciences, Qatar University, Doha P. O. Box 2713, Qatar

^*

Author to whom correspondence should be addressed.

Algorithms 2026, 19(6), 437; https://doi.org/10.3390/a19060437

Submission received: 22 April 2026 / Revised: 19 May 2026 / Accepted: 27 May 2026 / Published: 1 June 2026

Download

Browse Figures

Versions Notes

Abstract

Accurate temperature forecasting is essential for agriculture, disaster management, and infrastructure planning, yet numerical weather prediction models face increasing limitations under climate change. Deep learning architectures, particularly transformers, offer promising alternatives but suffer from quadratic complexity and limited ability to capture both long-range dependencies and local periodic patterns in volatile climate data. To address these challenges, this paper proposes the Hybrid Attention Informer (HA-Informer), a unified end-to-end framework that introduces three modifications to the standard Informer: a hybrid attention mechanism combining ProbSparse attention with depthwise separable convolutions to capture global and local patterns simultaneously, an adaptive distillation mechanism that dynamically adjusts compression based on attention entropy to preserve fine-grained information, and a residual refinement decoder that mitigates error accumulation in long-horizon forecasting. The proposed model is evaluated on hourly temperature data from three climatically diverse cities, Niamey (hot), Tehran (temperate), and Harbin (cold), against seven baselines, namely, LSTM, a CNN, ARIMA, XGBoost, DLinear, transformer, and the standard Informer. The experimental results demonstrate that HA-Informer consistently achieves the lowest forecasting errors across all three locations, with mean squared error reductions of approximately

54 %

over Informer,

85 %

over DLinear, and

90 %

over LSTM in the Niamey dataset, supported by statistically significant Diebold–Mariano test statistics (

p < 0.05

) confirming the superiority of the proposed approach.

Keywords:

deep neural network models; Informer model; temperature; transformer model

1. Introduction

Estimating future temperature is a fundamental component of weather forecasting systems. As climate change accelerates, accurate temperature forecasting becomes increasingly significant for agriculture, disaster management, and infrastructure planning. According to [1], traditional approaches to weather forecasting, particularly temperature estimation, predominantly depend on numerical weather prediction (NWP) models. By solving sophisticated physical and dynamic equations that describe atmospheric motion and its variations, these models employ a complex set of processes to estimate future atmospheric conditions [2,3,4,5,6,7].

Despite their widespread use, NWP models face inherent and growing limitations. Their precision declines as global warming intensifies and climate change progresses [8]. Since the 1980s, while NWP models have seen significant improvements due to advancements in computational capabilities, better modeling methods, and more precise data assimilation [9], they have still suffered from critical drawbacks. These include uncertainties in initial conditions and boundary settings [10], oversimplifications of surface characteristics and their impact on outputs [11], and structural limitations in how closely they represent real-world scenarios [12]. Consequently, researchers have increasingly explored alternative approaches.

In response to these NWP limitations, numerous researchers have explored deep learning methods for predicting and forecasting different meteorological parameters [13,14,15]. Deep learning models have emerged as potential alternatives to address the issues encountered in complex system evaluation and time series prediction [16,17,18,19], offering the ability to unravel intricate, non-linear patterns within data. Due to their capability to utilize past data for forecasting present and future states, recurrent neural networks (RNNs) have been widely applied in various studies to estimate upcoming time-step variables [20,21]. However, as network depth grows, RNNs frequently encounter the issue of gradient vanishing during training. Recent studies have therefore turned to clustering-based approaches that leverage temporal correlations among related sequences to enhance forecasting robustness. Pellicani et al. [22] introduced a real-time anomaly prediction method that clusters correlated cryptocurrencies and uses an online multi-target LSTM to forecast anomalies amid market volatility. Pellicani et al. [23] proposed CARROT, a novel approach that clusters cryptocurrencies by temporal correlation and uses multi-target LSTM per cluster to forecast anomalies, helping maximize profit or minimize losses. A dual-clustering framework on temporal and channel dimensions to enhance multivariate time series forecasting, despite heterogeneous patterns and complex inter-channel correlations, was investigated by Qiu et al. in [24]. However, as the temporal gap widens, long-term dependency challenges emerge, reducing the impact of earlier information on the present time step.

To overcome the long-term dependency problem, transformer models have been introduced as promising solutions [25]. Transformer models, created to capture relationships across various time steps [26,27,28,29], improve the ability to forecast extended sequences. Modern advancements in deep learning have resulted in weather forecasting models that achieve predictive accuracy on par with operational systems used by national meteorological agencies [30,31,32,33]. These models are constructed using ERA5 (representing the fifth-generation atmospheric reanalysis produced by ECMWF provides global climate information from January 1940 to the present and exhibits forecasting capability even when applied to initial conditions beyond those seen in their training data. Unlike other deep learning approaches that incorporate physical constraints directly into their architectures (e.g., [34]), it remains uncertain whether these models have truly internalized atmospheric processes, such as air motion dynamics and the propagation of disturbances, or if they are merely identifying statistical patterns that minimize prediction errors for subsequent data points.

In recent years, transformer-based architectures have significantly advanced time series forecasting. Huang et al. [35] proposed a long time series of ocean wave prediction based on PatchTST model. Liu et al. [36] proposed iTransformer, a simple yet effective architecture that applies attention and feed-forward networks on the inverted dimensions, with variables as tokens for time series forecasting. Wu et al. [37] introduced TimesNet, a general time series analysis method that models temporal variations by transforming 1D time series into 2D tensors to capture intraperiod and interperiod-variations. Lin and Shang [38] provided multi-head attention in DLinear to forecast acoustic emission signals for early real-time crack detection in train axles, preventing safety accidents. Despite these advances, a critical gap remains. While transformers improve long-range alignment, their self-attention mechanism incurs high computational and memory demands when handling long sequences, restricting practical deployment for extended forecasting. Moreover, it remains uncertain whether these models have truly internalized atmospheric physics or are merely minimizing prediction errors through statistical patterns. If models could be shown to generate physically aligned solutions, they would offer a remarkable opportunity for rapid hypothesis testing.

To overcome the efficiency limitations of standard transformers, the Informer model has been proposed. The Informer introduces a ProbSparse self-attention mechanism that dramatically improves computational and memory efficiency, enabling long time-series prediction in a single forward pass via a generative-style decoder [25]. While the standard Informer model has demonstrated promising results for long-sequence time-series forecasting [25], its direct application to meteorological data, particularly hourly temperature prediction across diverse climate zones, remains limited. Furthermore, the original Informer architecture has three inherent limitations when applied to highly volatile climate data: (i) the ProbSparse attention mechanism, though efficient, may overlook local periodic patterns that are critical for temperature forecasting; (ii) the fixed distillation process applies uniform compression regardless of attention diversity, potentially discarding fine-grained temporal information; and (iii) the decoder lacks explicit error correction mechanisms for long-horizon predictions, leading to accumulated forecasting errors.

To address these limitations, this paper proposes the HA-Informer, which introduces three specific modifications to the original architecture. First, a hybrid attention mechanism is proposed that augments the ProbSparse attention with a parallel learnable local pattern extraction module using depthwise separable convolutions, enabling the model to capture both long-range dependencies and short-term periodic fluctuations simultaneously. Second, an adaptive distillation mechanism is developed that dynamically adjusts feature compression based on attention entropy, preserving fine-grained temporal details when attention patterns are highly dispersed. Third, a residual refinement decoder is designed that incorporates a learnable skip connection to reduce error accumulation in long-horizon forecasting. These three innovations are integrated into a unified end-to-end trainable framework, specifically tailored for hourly temperature prediction across hot, temperate, and cold climate zones. To the best of our knowledge, this work is the first to introduce entropy-guided adaptive distillation and hybrid convolution-attention mechanisms into the Informer architecture for meteorological time-series forecasting.

Consistent with the architectural innovations described above, this study pursues three major objectives, each supported by a dedicated contribution:

Simultaneously capture long-range dependencies and short-term periodic fluctuations in hourly temperature data via a hybrid attention mechanism that enhances ProbSparse attention with a parallel depthwise separable convolution branch, using a learnable mixing parameter to adaptively balance global context and local patterns.
Prevent the loss of fine-grained temporal information during sequence compression in the encoder through an adaptive distillation mechanism that dynamically adjusts the pooling ratio based on attention entropy, preserving informative details when attention is widely dispersed rather than applying uniform compression.
Mitigate error accumulation in long-horizon temperature forecasting by introducing a residual refinement decoder with a learnable skip connection, providing an auxiliary gradient pathway that directly corrects attention errors and ensures sublinear error growth relative to forecast length.

These three contributions are not independent add-ons but are jointly optimized within a unified end-to-end framework, which we denote as the HA-Informer. The proposed model was evaluated against seven competitive baselines—LSTM [39], a CNN [40], ARIMA [41], XGBoost [42], DLinear [43], transformer [44], and the standard Informer [25]—using hourly temperature data from three climatically diverse cities: Niamey (hot), Tehran (temperate), and Harbin (cold).

The remainder of this paper is organized as follows. Section 2 describes the deep transformer framework. Section 3 presents the proposed HA-Informer and its three key innovations. Section 4 covers the experimental setup and empirical results, including error comparisons, visual assessments, statistical tests, and computational analysis. Finally, Section 5 concludes the paper.

2. Deep Transformer Framework

The transformer model, as introduced by Vaswani et al. [44], processes sequential data using a self-attention mechanism that enables parallel computation and captures long-range dependencies. In the following, we provide a comprehensive mathematical description of the core components of the transformer, which we adopt as the foundation of our deep informer framework. Let an input sequence be represented as a real-valued matrix

X = {[x_{1}, x_{2}, \dots, x_{L}]}^{⊤} \in R^{L \times d_{model}},

where L is the length of the sequence (number of tokens or time steps) and

d_{model}

is the embedding dimension of each token. Each row vector

x_{i} \in R^{d_{model}}

corresponds to the initial representation of the i-th element in the sequence. Because the transformer architecture is permutation–invariant—meaning that without additional information it would treat the sequence as an unordered set—positional encodings are added to the input to retain information about the absolute and relative order of the sequence. The positionally encoded input is given by

X_{pos}^{(i)} = x_{i} + p_{i},

where

p_{i} \in R^{d_{model}}

is the positional vector for position index i. A common choice is sinusoidal encoding, which has the advantage of allowing the model to extrapolate to sequence lengths longer than those seen during training. For each dimension index k satisfying

0 \leq k < d_{model} / 2

, we define the even and odd components of

p_{i}

as

\begin{matrix} p_{(i, 2 k)} = sin (\frac{i}{T^{2 k / d_{model}}}), p_{(i, 2 k + 1)} = cos (\frac{i}{T^{2 k / d_{model}}}), \end{matrix}

(1)

where T is a constant controlling the maximum wavelength; in the original transformer formulation,

T = 10,000

. The intuition is that each dimension of the positional encoding oscillates at a different frequency, allowing the attention mechanism to easily learn to attend to relative positions.

Given the position-encoded input, the core mechanism of the transformer is attention, which computes a weighted sum of values based on the similarity between queries and keys. Define three matrices derived from the input: the query matrix

Q

, the key matrix

K

, and the value matrix

V

. In the simplest case of self-attention, all three come from the same input

X_{pos}

, but in general they can be different. The scaled dot-product attention is defined as

Attention (Q, K, V) = softmax (\frac{Q K^{⊤}}{\sqrt{d_{k}}}) V .

Here, the dimensions are

Q \in R^{L \times d_{k}}, K \in R^{L \times d_{k}}, V \in R^{L \times d_{v}},

where

d_{k}

is the dimension of the query and key vectors, and

d_{v}

is the dimension of the value vectors. The product

Q K^{⊤}

yields an

L \times L

matrix of raw attention scores, where entry

(i, j)

is the dot product between the query at position i and the key at position j, indicating how much position i should attend to position j. These scores are scaled down by

\sqrt{d_{k}}

to prevent excessively large values that would push the softmax function into regions of extremely small gradients. The softmax function is applied row-wise to convert the scores into a probability distribution over the L positions. For any matrix

Z

, the softmax is defined elementwise as

softmax {(Z)}_{i j} = \frac{e^{Z_{i j}}}{\sum_{k = 1}^{L} e^{Z_{i k}}} .

The resulting

L \times L

attention weight matrix is then multiplied by

V

to produce an

L \times d_{v}

output, where each row is a weighted sum of the value vectors. To enable the model to focus on different representational subspaces simultaneously, multi-head attention is introduced. Instead of performing a single attention operation, we compute h independent attention operations in parallel, each with its own learned projections. The multi-head attention output is defined as

MultiHead (Q, K, V) = Concat ({head}_{1}, \dots, {head}_{h}) W^{O},

where each attention head is given by

{head}_{i} = Attention (Q W_{i}^{Q}, K W_{i}^{K}, V W_{i}^{V}) .

The learnable projection matrices have the following dimensions:

W_{i}^{Q} \in R^{d_{model} \times d_{k}}, W_{i}^{K} \in R^{d_{model} \times d_{k}}, W_{i}^{V} \in R^{d_{model} \times d_{v}}, W^{O} \in R^{h d_{v} \times d_{model}} .

In practice, we typically set

d_{k} = d_{v} = d_{model} / h

to keep the computational cost constant across different numbers of heads. The output of each head

{head}_{i}

has dimension

L \times d_{v}

; concatenating all h heads yields an

L \times (h d_{v})

matrix, which is then multiplied by

W^{O}

to project back to

L \times d_{model}

.

The transformer encoder is built by stacking multiple identical layers, each composed of a multi-head attention sublayer followed by a position-wise feed-forward network (FFN). Around each sublayer, a residual connection is used, followed by layer normalization (LayerNorm). Let

X

be the input to an encoder layer. The first sublayer performs multi-head self-attention where the queries, keys, and values all come from the same input

X

. The intermediate output

X^{'}

is computed as

X^{'} = LayerNorm (X + MultiHead (X, X, X)) .

Here, the addition

X + MultiHead (\dots)

is the residual connection, which helps with gradient flow during backpropagation. LayerNorm is applied to the sum, normalizing across the feature dimension. The second sublayer is a position-wise feed-forward network, which applies the same fully connected network independently to each position. The final output

Z

of the encoder layer is

Z = LayerNorm (X^{'} + FFN (X^{'})) .

The feed-forward network itself is a two-layer transformation with a ReLU activation in between. For a single position vector

x \in R^{d_{model}}

(a row of

X^{'}

), the FFN is defined as

FFN (x) = max (0, x W_{1} + b_{1}) W_{2} + b_{2},

where

W_{1} \in R^{d_{model} \times d_{ff}}

,

W_{2} \in R^{d_{ff} \times d_{model}}

, and

b_{1} \in R^{d_{ff}}

,

b_{2} \in R^{d_{model}}

are learnable parameters. The hidden dimension

d_{ff}

is typically larger than

d_{model}

(e.g.,

d_{ff} = 4 d_{model}

). The

max (0, \cdot)

operation applies the ReLU activation function elementwise.

When a decoder is used in sequence-to-sequence tasks such as machine translation, it resembles the encoder but includes additional components. The decoder consists of three sublayers per layer: a masked multi-head self-attention sublayer, a multi-head encoder–decoder attention sublayer, and a feed-forward network. In the masked self-attention, the attention mechanism is prevented from attending to subsequent positions by applying a mask matrix

M

to the attention scores before the softmax, where

M_{i j} = - \infty

for

j > i

and

M_{i j} = 0

otherwise. This ensures that the prediction for position i depends only on known outputs at positions less than i. The encoder–decoder attention sublayer uses queries from the previous decoder sublayer but keys and values from the encoder output, allowing the decoder to attend to all positions of the input sequence.

A fundamental limitation of the standard transformer is its computational complexity. The self-attention mechanism requires computing the

L \times L

matrix

Q K^{⊤}

and then the softmax over each of its rows. The total time complexity is

Complexity (Attention) = O (L^{2} \cdot d_{k}) .

Since

d_{k}

is typically much smaller than L for long sequences, the dominant term is

O (L^{2})

. This quadratic scaling with respect to the sequence length L becomes prohibitive for long sequences, such as entire books, high-resolution images treated as sequences of pixels, or long time series. This limitation motivates the development of sparse attention variants, where only a subset of the

L^{2}

pairwise interactions are computed, or linearized attention variants, where the softmax kernel is approximated to achieve

O (L d_{k}^{2})

complexity.

3. Hybrid Attention-Based Informer

The Informer model, introduced by Zhou et al. [25], addresses the fundamental challenge of Long Sequence Time-series Forecasting (LSTF), where the goal is to predict an output sequence of length

L_{y}

given an input sequence of length

L_{x}

, with

L_{y}

often being very large (e.g., 720 time steps ahead). The standard transformer [44] suffers from quadratic complexity

O (L^{2})

in the input sequence length L, making it prohibitive for LSTF. The Informer overcomes this through three key innovations: the ProbSparse self-attention mechanism, self-attention distillation, and a generative-style decoder. To understand these innovations, we first formalize the input and output representations.

Let an input sequence of multivariate time series be defined as

X^{t} = {x_{1}^{t}, x_{2}^{t}, \dots, x_{L_{x}}^{t} ∣ x_{i}^{t} \in R^{d_{x}}},

where each

x_{i}^{t}

is a

d_{x}

-dimensional observation at time step i within window t. The target output sequence is

Y^{t} = {y_{1}^{t}, y_{2}^{t}, \dots, y_{L_{y}}^{t} ∣ y_{i}^{t} \in R^{d_{y}}} .

Having established the sequence definitions, we now turn to the core mechanism that enables the Informer to achieve efficiency.

The core of the Informer is the ProbSparse attention mechanism. For a standard self-attention layer with query matrix

Q \in R^{L \times d}

, key matrix

K \in R^{L \times d}

, and value matrix

V \in R^{L \times d}

, the attention output is

Attention (Q, K, V) = softmax (\frac{Q K^{⊤}}{\sqrt{d}}) V .

This computation requires evaluating all

L^{2}

query–key pairs, leading to

O (L^{2} d)

time complexity. The ProbSparse mechanism observes that the attention distribution often exhibits a long-tail property: only a few query–key pairs contribute significantly to the softmax output. Consequently, for a given query

q_{i} \in R^{d}

and key set K of size

L_{K}

, the sparsity of the attention distribution is measured using the Kullback–Leibler divergence between the uniform distribution and the true attention distribution. This leads to the max-mean sparsity measure

\begin{matrix} \hat{M} (q_{i}, K) = max_{j} \{\frac{q_{i} k_{j}^{⊤}}{\sqrt{d}}\} - \frac{1}{L_{K}} \sum_{j = 1}^{L_{K}} \frac{q_{i} k_{j}^{⊤}}{\sqrt{d}} . \end{matrix}

(2)

To justify this measure theoretically, we examine the following bound.

The theoretical bound for this measure satisfies

ln L_{K} \leq M (q_{i}, K) \leq max_{j} \{\frac{q_{i} k_{j}^{⊤}}{\sqrt{d}}\} - \frac{1}{L_{K}} \sum_{j = 1}^{L_{K}} \frac{q_{i} k_{j}^{⊤}}{\sqrt{d}} + ln L_{K} .

From this bound, we see that queries with larger

\hat{M} (q_{i}, K)

values correspond to more diverse attention distributions and are more likely to capture dominant dot-product associations. Therefore, only the top

u = ⌈ c \cdot log L ⌉

queries with the largest

\hat{M}

values are retained, forming the sparse query matrix

\hat{Q} \in R^{u \times d}

. The ProbSparse attention is then computed as

{Attention}_{Informer} (\hat{Q}, K, V)

. This reduction in the number of queries directly reduces the time complexity from

O (L^{2} d)

to

O (L log L \cdot d)

. However, the encoder still needs to handle long sequences efficiently, which brings us to the distillation mechanism.

The Informer encoder employs a distillation mechanism to progressively reduce the sequence length while retaining dominant features. For the j-th encoder layer, the distillation operation from layer j to layer

j + 1

is defined as [45]

X_{j + 1}^{t} = MaxPool (ELU (Conv 1 d ({[X_{j}^{t}]}_{A B}))),

where

{[X_{j}^{t}]}_{A B}

denotes the output of the attention block at layer j,

Conv 1 d (\cdot)

applies a one-dimensional convolution with kernel size 3,

ELU (x)

is the Exponential Linear Unit activation function defined by

ELU (x) = x

for

x > 0

and

ELU (x) = e^{x} - 1

for

x \leq 0

, and

MaxPool (\cdot)

applies max-pooling with stride 2 and kernel size 2, reducing the sequence length by exactly half. Multiple distillation layers are stacked in this manner, and the feature maps from all layers are concatenated to form the final encoder output. Once the encoder has compressed the input sequence, the decoder must generate the forecast in an efficient manner.

The Informer decoder is designed for generative inference, producing the entire output sequence in a single forward pass. The decoder input is constructed as

X_{d e}^{t} = Concat (X_{t o k e n}^{t}, X_{0}^{t}) \in R^{(L_{t o k e n} + L_{y}) \times d_{model}},

where

X_{t o k e n}^{t} \in R^{L_{t o k e n} \times d_{model}}

is the starting token sequence derived from the encoder output (typically the last

L_{t o k e n}

time steps), and

X_{0}^{t} \in R^{L_{y} \times d_{model}}

is a placeholder sequence initialized to zeros. To ensure causality during generation, a masked multi-head self-attention mechanism sets attention scores to

- \infty

for all pairs where the target position index is less than the source position index:

\begin{matrix} MaskedAttn (Q, K, V) = softmax (\frac{\hat{Q} K^{⊤} + M}{\sqrt{d}}) V, \end{matrix}

(3)

with the mask matrix M defined by

M_{i j} = 0

for

j \leq i

and

M_{i j} = - \infty

for

j > i

. The final prediction is obtained through a linear projection

{\hat{Y}}_{Informer}^{t} = Linear (MaskedAttn (X_{d e}^{t})) .

Despite these innovations, the original Informer has limitations that motivate further improvement, which leads us to propose the Hybrid Attention-based Informer.

Now we provide the proposed Hybrid Attention-based Informer. The HA-Informer retains the overall encoder-decoder architecture of the original Informer but replaces or augments three critical components with hybrid mechanisms that explicitly address the limitations of the standard Informer. Specifically, the first modification is the introduction of a hybrid attention mechanism that combines the global sparse ProbSparse attention with a local convolutional branch, the second modification replaces the fixed distillation operation with an adaptive distillation mechanism that dynamically adjusts the compression ratio based on the entropy of the attention distribution, and the third modification augments the decoder with a residual refinement network that provides an auxiliary gradient pathway and corrects attention errors. We begin with the hybrid attention mechanism.

The Hybrid Attention mechanism in the HA-Informer is defined as a convex combination of the original ProbSparse attention and a depthwise separable convolutional operation applied to the value matrix:

\begin{matrix} {Attention}_{HA} (Q, K, V) = α \cdot softmax (\frac{\hat{Q} K^{⊤}}{\sqrt{d}}) V + (1 - α) \cdot {Conv 1 D}_{dw} (V) . \end{matrix}

(4)

Here,

α \in [0, 1]

is a learnable scalar parameter initialized to 0.5, allowing the model to automatically balance between global and local feature extraction during training. The term

{Conv 1 D}_{dw} (\cdot)

denotes a depthwise separable one-dimensional convolution [46] with kernel size

k = 5

. For an input tensor of shape

(L, d_{model})

, the depthwise separable convolution first applies a depthwise convolution, where each input channel is convolved independently with its own kernel, requiring

O (d_{model} \cdot k)

parameters, followed by a pointwise convolution (a

1 \times 1

convolution) that mixes information across channels, requiring

O (d_{model}^{2})

parameters. The total parameter complexity is

O (d_{model} \cdot k + d_{model}^{2})

, compared to

O (d_{model}^{2} \cdot k)

for a standard convolution. The hybrid attention output retains the same dimensions as the input, and the parameter

α

is optimized jointly with all other network parameters using gradient descent. Having enhanced the attention mechanism, we next address the distillation process.

The adaptive distillation mechanism in the HA-Informer replaces the fixed max-pooling operation with an entropy-guided adaptive pooling operation. For the attention block output at layer j, let the attention probabilities for query i be denoted by

p_{i}

, forming a probability distribution over the key positions. The Shannon entropy [47] of the attention distribution is computed as

H_{j} = - \sum_{i = 1}^{L_{K}} p_{i} log p_{i} .

This entropy measures the dispersion of attention: low entropy indicates that attention is concentrated on a few key positions, while high entropy indicates that attention is widely dispersed across many positions. The entropy is then passed through a sigmoid activation function to produce a gating factor

β_{j} = σ (H_{j}) = \frac{1}{1 + e^{- H_{j}}} .

Moreover, the adaptive pooling ratio is defined as a function of

β_{j}

, and

\begin{matrix} r_{adaptive} = 0.5 + 0.3 \cdot (1 - β_{j}), \end{matrix}

(5)

where

a = 0.5

and

b = 0.3

are constants determined by the specific pooling strategy. From [45], the adaptive distillation operation from layer j to layer

j + 1

is then

\begin{matrix} X_{j + 1}^{t} = AdaptivePool (ELU (Conv 1 d ({[X_{j}^{t}]}_{A B}))) ⊙ β_{j}, \end{matrix}

(6)

where

AdaptivePool (\cdot)

applies pooling with the dynamically determined ratio

r_{adaptive}

, and ⊙ denotes element-wise multiplication by the gating factor

β_{j}

. The product

⊙ β_{j}

further scales the pooled features, allowing the model to down-weight features from layers with highly dispersed attention. After improving the encoder with adaptive distillation, we turn our attention to decoder enhancements.

The Residual Refinement Decoder in the HA-Informer adds an auxiliary refinement pathway that bypasses the masked attention mechanism. The decoder output is computed as

{\hat{Y}}_{HA}^{t} = Linear (MaskedAttn (X_{d e}^{t}) + γ \cdot Refine (X_{d e}^{t})),

where

Refine (\cdot)

is a two-layer position-wise feedforward network with hidden dimension

d_{model} / 2

and ReLU activation [48]

\begin{matrix} Refine (x) = max (0, x W_{1} + b_{1}) W_{2} + b_{2}, \end{matrix}

(7)

with

W_{1} \in R^{d_{model} \times d_{model} / 2}

,

W_{2} \in R^{d_{model} / 2 \times d_{model}}

,

b_{1} \in R^{d_{model} / 2}

, and

b_{2} \in R^{d_{model}}

. The scalar parameter

γ

is learnable and initialized to 0.1, controlling the contribution of the refinement pathway. The term

γ \cdot Refine (X_{d e}^{t})

adds directly to the masked attention output before the final linear projection, providing a residual connection that allows gradients to flow directly from the loss to the decoder input without passing through the attention mechanism. With all three components defined, we now describe how the entire HA-Informer is trained.

The HA-Informer is trained end-to-end using the mean squared error loss function

\begin{matrix} L = \frac{1}{L_{y}} \sum_{i = 1}^{L_{y}} {∥ {\hat{y}}_{i} - y_{i} ∥}^{2}, \end{matrix}

(8)

where

{\hat{y}}_{i}

is the predicted value at position i and

y_{i}

is the ground truth. All three additional parameters

α

,

β_{j}

(computed from attention entropy without additional trainable parameters), and

γ

are optimized jointly with the existing Informer parameters. Having established the HA-Informer architecture, we now demonstrate its mathematical superiority over the standard Informer, beginning with local pattern capture.

The standard Informer uses only ProbSparse attention, which selects queries based on the sparsity measure

\hat{M} (q_{i}, K)

. For a time series with strong local autocorrelation at lag

τ

where

τ ≪ L

, the relevant query–key pairs may not be among the top

u = ⌈ c log L ⌉

queries if the local pattern does not produce large

\hat{M}

values. In contrast, the HA-Informer’s convolutional branch explicitly models local patterns with complexity

O (L \cdot d_{model} \cdot k)

. For any local window of size k, the convolutional branch guarantees that local interactions are captured regardless of the sparsity measure, whereas the standard Informer may miss them entirely. Furthermore, the mixing parameter

α

adapts to the data: for a purely long-range dependent process,

α \to 1

and the HA-Informer recovers the standard Informer; for a process with strong local structure,

α \to 0

and the model relies on convolutions. The standard Informer forces a fixed sparsity pattern that cannot adapt to the local-global trade-off. Beyond attention, the distillation process also reveals significant differences between the two models.

On the other hand, in the standard Informer, the distillation operation applies a fixed compression ratio

r_{fixed} = 0.5

at every layer, meaning exactly half of the temporal positions are discarded regardless of the information content. Let the attention entropy at layer j be

H_{j}

. The information loss due to pooling can be quantified as

Δ_{Informer} = 0.5 \cdot Var (X_{j}),

where

Var (X_{j})

is the variance of the feature map. When attention is highly dispersed (large

H_{j}

), many temporal positions are relevant, and discarding half of them causes significant information loss. The HA-Informer instead uses

r_{adaptive} = 0.5 + 0.3 (1 - β_{j})

. For large entropy

H_{j}

,

β_{j} = σ (H_{j})

approaches 1, so

1 - β_{j}

approaches 0; thus,

r_{adaptive} \approx 0.5

. For small entropy

H_{j}

(concentrated attention),

β_{j}

approaches 0, so

1 - β_{j}

approaches 1; thus,

r_{adaptive} \approx 0.8

. This means the HA-Informer preserves more information (compresses less) when attention is concentrated, which is when the information is most valuable. More precisely, the information preservation ratio is

R_{HA} = 1 - r_{adaptive} = 0.5 - 0.3 (1 - β_{j}),

while the standard Informer has

R_{Informer} = 0.5

always. When attention is highly concentrated (

β_{j} \to 0

),

R_{HA} = 0.5 - 0.3 = 0.2

, meaning only

20 %

of information is discarded compared to

50 %

in the standard Informer. This represents a

60 %

reduction in information loss for concentrated attention patterns, which are precisely the patterns that are most important for accurate forecasting. Finally, we examine the decoder behavior for long-horizon forecasting.

Let the prediction error at position i for the standard Informer be denoted

ϵ_{i}^{Inf}

. Because the decoder uses zero-initialized placeholders and no corrective mechanism, errors propagate autoregressively. For large

L_{y}

, the cumulative error satisfies

ϵ_{L_{y}}^{Inf} \approx ϵ_{0} \cdot L_{y},

where

ϵ_{0}

is the per-step error rate. The HA-Informer decoder adds the refinement term

γ \cdot Refine (X_{d e}^{t})

, which provides a direct mapping from the decoder input to the output. Let the approximation error of the refinement network be bounded by

δ

, i.e.,

∥ Refine (X_{d e}^{t}) - Y_{true} ∥ \leq δ

. Then the HA-Informer prediction error is bounded by

ϵ_{L_{y}}^{HA} \leq α \cdot ϵ_{L_{y}}^{Inf} + (1 - α) \cdot δ,

where we have absorbed the effect of

γ

into the effective

α

for clarity. Since

δ

is independent of

L_{y}

(the refinement network operates position-wise and does not propagate errors across time), the HA-Informer error grows sublinearly in

L_{y}

while the standard Informer error grows linearly. For sufficiently large

L_{y}

, the HA-Informer achieves

\frac{ϵ_{L_{y}}^{HA}}{ϵ_{L_{y}}^{Inf}} \approx \frac{(1 - α) δ}{ϵ_{0} L_{y}} \to 0 .

This asymptotic superiority, which follows directly from the error bound analysis, is particularly pronounced for long forecasting horizons where

L_{y}

is large, confirming that the HA-Informer presented in Algorithm 1 outperforms the Informer in the LSTF setting.

Algorithm 1 HA-Informer algorithm

1:: Input: Input sequence $X \in R^{L_{x} \times d_{x}}$ , forecast horizon $L_{y}$ , number of encoder layers N, number of decoder layers M, constant $c = 5$ , kernel size $k = 5$
2:: Output: Predicted sequence $\hat{Y} \in R^{L_{y} \times d_{y}}$
3:: Step 1: Positional Encoding
4:: $X_{p o s} \leftarrow PositionalEncoding (X)$ using sinusoidal functions in (1).
5:: Step 2: Encoder Processing with Adaptive Distillation
6:: for $j = 1$ to N do
7:: Compute ProbSparse attention with sparsity measure for each query $q_{i}$ using (2).
8:: Retain top $u = ⌈ c log L ⌉$ queries with largest $\hat{M}$ to form $\hat{Q}$
9:: Compute attention output using (4).
10:: Apply residual connection and layer normalization:

${[X_{j}]}_{A B} = LayerNorm (X_{j} + {Attn}_{HA} (X_{j}, X_{j}, X_{j}))$
11:: Compute attention entropy as $H_{j} = - \sum_{i = 1}^{L_{K}} p_{i} log p_{i},$ and $β_{j} = σ (H_{j}) = \frac{1}{1 + e^{- H_{j}}} .$
12:: Compute adaptive pooling ratio using (5).
13:: Apply adaptive distillation using (6).
14:: end for
15:: Concatenate feature maps from all encoder layers: $X_{e n c} = Concat (X_{1}, X_{2}, \dots, X_{N})$
16:: Step 3: Decoder Input Construction
17:: $X_{t o k e n} \leftarrow Last (X_{e n c}, L_{t o k e n})$ where $L_{t o k e n}$ is typically $L_{x} / 2$
18:: $X_{0} \leftarrow Zeros (L_{y}, d_{model})$
19:: $X_{d e} \leftarrow Concat (X_{t o k e n}, X_{0}) \in R^{(L_{t o k e n} + L_{y}) \times d_{model}}$
20:: Step 4: Decoder Processing with Residual Refinement
21:: for $j = 1$ to M do
22:: Apply masked ProbSparse self-attention with causal mask $M_{i j} = 0$ for $j \leq i$ , $M_{i j} = - \infty$ for $j > i$ using (3).
23:: Apply cross-attention using queries from decoder and keys/values from encoder
24:: Compute refinement output by (7).
25:: Combine masked attention with residual refinement

$X_{d e}^{(j + 1)} = LayerNorm (X_{d e}^{(j)} + MaskedAttn (X_{d e}^{(j)}) + γ \cdot Refine (X_{d e}^{(j)}))$
26:: end for
27:: Step 5: Output Projection
28:: $\hat{Y} \leftarrow Linear (X_{d e}^{(M)}) \in R^{L_{y} \times d_{y}}$
29:: Step 6: Loss Computation (Training only)
30:: Compute Mean Squared Error via (8).
31:: Update parameters $α$ , $γ$ , and all network weights via gradient descent and return $\hat{Y}$ .

4. Analysis and Results

4.1. Data Preparation

This study explores temperature variations and forecasting models across distinct climatic zones to provide insights into regional and global temperature trends. The selected regions for this analysis are Niamey (Niger), Tehran (Iran), and Harbin (China), chosen to represent three different climate types: hot (Niamey), temperate (Tehran), and cold (Harbin). This diversity enables a comprehensive evaluation of temperature patterns across various ecosystems, enhancing the understanding of how different climatic conditions influence forecasting models. The rationale for selecting these locations is outlined below:

•: Niamey: Frequent temperature spikes may cause certain models to overfit the data.
•: Harbin: Prolonged winters may introduce long-term dependencies that models like RNNs or LSTM can struggle to learn effectively.
•: Tehran: More stable temperature variations enable models to capture trends more easily; however, this regularity may also lead to overfitting, thereby reducing generalization performance.

Specifically, Niamey, located in Niger, experiences a hot desert climate with extreme temperatures and minimal rainfall. Tehran, the capital of Iran, has a semi-arid climate characterized by hot summers and mild winters, influenced by its location between desert plains and mountain ranges. In contrast, Harbin, in northeastern China, features a cold continental climate, with long, harsh winters and short, warm summers. Each of these cities represents a distinct climatic zone, contributing to the diversity of the dataset for analyzing temperature patterns.

In this study, the HA-Informer model is applied as a sequence-to-sequence long-term time-series forecasting framework, where the hourly temperature observations from Niamey, Tehran, and Harbin are first preprocessed into normalized temporal sequences and then used as input tokens to learn long-range dependencies through the ProbSparse self-attention mechanism. The model is trained to capture both short-term fluctuations and long-term seasonal trends in temperature dynamics, and the 80/20 train–test split is strictly preserved to ensure an unbiased evaluation of forecasting performance across all three regions. The temperature data used in this study were obtained from the NASA POWER (Prediction Of Worldwide Energy Resources) Data Access Viewer, which provides publicly accessible and quality-controlled meteorological datasets derived from satellite observations and assimilated models. The data are available at: https://power.larc.nasa.gov/data-access-viewer/, accessed on 1 February 2026.

It is acknowledged that such factors can influence forecasting accuracy in real-world deployments. However, the datasets used in this work are standard meteorological reanalysis/observational time series where the measurements are aggregated at city-level stations rather than individual heterogeneous sensor networks, and detailed sub-station metadata is not available within the scope of this study. Therefore, the focus is on evaluating model performance under consistent, publicly available time-series conditions rather than on micro-level sensor placement effects. With respect to site selection, the three cities were chosen specifically to represent diverse climatic regimes, namely, arid (Niamey), temperate continental (Tehran), and cold continental (Harbin) conditions, which allows us to test the robustness of the HA-Informer model across varying seasonal patterns and volatility levels. While the reviewer correctly notes that coastal regions or sparse observational networks can introduce additional forecasting challenges, this study does not rely on coastal datasets, and all selected locations are inland urban meteorological series with continuous hourly coverage over the specified periods. This controlled selection helps ensure that differences in performance are primarily attributable to climatic variability rather than data sparsity or sensor distribution effects, thereby strengthening the validity of the comparative evaluation.

The dataset consists of 17,545 hourly temperature recordings from three regions: Niamey, Tehran, and Harbin. The data cover a two-year period from 1 February 2024 to 1 February 2026, ensuring sufficient temporal coverage for training deep learning models. For each city, 80% of the data was used for training, and 20% was reserved for testing to evaluate model performance. Descriptive statistics of the temperature data for all three cities, namely, mean, standard deviation, minimum, maximum, and quartiles, are presented in Table 1.

4.2. Model Architectures

The main model employed in this research is the HA-Informer, a transformer-based architecture implemented using PyTorch version 2.1.0. After splitting the dataset, several preprocessing steps were applied, including normalization using the StandardScaler method, which scaled the data to have a mean of zero and a standard deviation of one. In this study, we adopted a multi-step forecasting horizon of

L y = 24

(predicting the next 24 h). Moreover, the model’s hyperparameters were optimized through iterative experimentation with various configurations based on the grid search method. The optimal configuration, determined through this process, is summarized in Table 2, Table 3 and Table 4.

To ensure a fair and meaningful comparison, all baseline models, namely, LSTM, a CNN, ARIMA, XGBoost, DLinear, the standard transformer, and the standard Informer, were subjected to the same rigorous hyperparameter optimization procedure as the proposed HA-Informer model. Specifically, for each baseline and each city, we performed a grid search over a carefully designed hyperparameter space, guided by validation set performance. For LSTM, we tuned the number of hidden units (64 to 256), the number of layers (1 to 3), the dropout rate (0.1 to 0.5), and the learning rate (1 × 10⁻⁵ to 1 × 10⁻³). For the CNN, we optimized the kernel size (3, 5, 7), the number of filters (32–128), and the number of convolutional layers (1–3). For ARIMA, we searched over the autoregressive order p (0 to 7), the differencing order d (0 to 2), and the moving average order q (0 to 7) using the Akaike Information Criterion (AIC). For XGBoost, we searched over

n_{estimators}

(100 to 500),

{max}_{depth}

(3 to 10), the learning rate (0.01 to 0.3), and the subsample ratio (0.6 to 1.0). For DLinear, we tuned the kernel size for the moving average decomposition (3 to 25), the learning rate (1 × 10⁻⁵ to 1 × 10⁻³), and the batch size (8 to 32). For the standard transformer, we tuned the number of encoder/decoder layers (2 to 4), model dimensions (128 to 512), the number of attention heads (4 to 8), feed-forward dimensions (512 to 2048), and the learning rate (1 × 10⁻⁵ to 1 × 10⁻³). For the standard Informer, we followed the grid search ranges, including model dimensions (256 to 512), encoder/decoder layers (2 to 3), the number of heads (7 to 8), batch size (10 to 12), and the learning rate (3 × 10⁻⁴ to 1 × 10⁻³). All baselines were trained and evaluated on the identical 80/20 train–test splits with the same random seed to ensure reproducibility.

4.3. Empirical Results

The three selected cities inherently present different levels of data-related challenges. Among them, Harbin serves as an example of data limitations due to its extremely cold continental climate, where prolonged sub-freezing temperatures can introduce sensor stability concerns in raw ground observations and create highly volatile temporal patterns that challenge forecasting models. Similarly, Niamey’s hot arid climate poses risks of sensor saturation during extreme heat events, while Tehran’s semi-arid conditions offer relatively stable data characteristics. By deliberately including Harbin—a location where data limitations and climatic complexity are most pronounced—this study evaluates the Informer model under challenging conditions that approximate real-world forecasting difficulties.

As stated, Harbin exhibits the highest frequency of missing observations due to sensor stability challenges under extreme cold conditions, making it a representative case for evaluating model robustness under real-world data limitations. The linear interpolation applied here partially mitigates this issue. Let the raw hourly temperature time series for each location be denoted as

\tilde{X} = {{\tilde{x}}_{1}, {\tilde{x}}_{2}, \dots, {\tilde{x}}_{T}}, {\tilde{x}}_{t} \in R .

(9)

To ensure temporal consistency, missing observations are first identified. For any missing value at time step t, linear interpolation is applied as

{\tilde{x}}_{t} = {\tilde{x}}_{t_{1}} + \frac{t - t_{1}}{t_{2} - t_{1}} ({\tilde{x}}_{t_{2}} - {\tilde{x}}_{t_{1}}),

(10)

where

t_{1} < t < t_{2}

are the nearest previous and next observed time indices. To mitigate the influence of outliers, a local moving average is computed

{\bar{x}}_{t} = \frac{1}{2 k + 1} \sum_{i = - k}^{k} {\tilde{x}}_{t + i},

(11)

and any observation that satisfies

| {\tilde{x}}_{t} - {\bar{x}}_{t} | > δ

(12)

is considered an outlier and replaced by

{\bar{x}}_{t}

, where

δ

is a predefined threshold. Subsequently, min–max normalization is applied to obtain the processed sequence

x_{t} = \frac{{\tilde{x}}_{t} - min (\tilde{X})}{max (\tilde{X}) - min (\tilde{X})}, x_{t} \in [0, 1] .

(13)

Following preprocessing, the normalized sequence is transformed into supervised samples using a sliding window approach. The input-output pairs are defined as

X^{t} = {x_{t - L_{x} + 1}, x_{t - L_{x} + 2}, \dots, x_{t}} \in R^{L_{x} \times d_{x}},

(14)

Y^{t} = {x_{t + 1}, x_{t + 2}, \dots, x_{t + L_{y}}} \in R^{L_{y} \times d_{y}},

(15)

where for univariate temperature data

d_{x} = d_{y} = 1

, and

L_{x}

,

L_{y}

denote the input and prediction lengths, respectively. The constructed sequences

(X^{t}, Y^{t})

are fed into the HA-Informer model, which follows an encoder-decoder architecture. The encoder maps the input sequence into a latent representation

Z^{t} = Encoder (X^{t}),

(16)

and the decoder generates the predicted output sequence

{\hat{Y}}^{t} = Decoder (Z^{t}) .

(17)

The model is trained by minimizing a prediction loss function over the training set

L = \frac{1}{N} \sum_{t = 1}^{N} ∥ Y^{t} - {\hat{Y}}^{t} ∥_{2}^{2},

(18)

where N is the number of training samples. Thus, the overall pipeline can be expressed as a mapping

\tilde{X} \overset{preprocessing}{\to} X \overset{windowing}{\to} (X^{t}, Y^{t}) \overset{Informer}{\to} {\hat{Y}}^{t},

(19)

followed by evaluation on the test set (20%) to assess generalization performance.

Table 5 presents a quantitative comparison of forecasting accuracy across all models and three cities. The HA-Informer model consistently achieves the lowest error values in terms of MSE, MAE, and RMSE across Niamey, Tehran, and Harbin, followed by the standard Informer as the second-best model. For instance, in Niamey, HA-Informer achieves an MSE of 0.0006, compared to 0.0013 for Informer, 0.0039 for DLinear, and 0.0058 for LSTM, corresponding to improvements of approximately 54%, 85%, and 90%, respectively. Among baseline models, DLinear ranks third due to its effective linear decomposition, while LSTM and XGBoost show moderate performance. The CNN consistently exhibits the weakest performance across all cities (e.g., MSE of 0.0675 in Harbin), due to its limited capability in modeling long-range temporal dependencies in hourly temperature data.

The margin of improvement of HA-Informer over other models is slightly reduced in Harbin compared to Niamey and Tehran, likely due to Harbin’s more complex and highly variable cold climate patterns, which increase prediction difficulty for all models. Nevertheless, the proposed model maintains its superiority across all locations, demonstrating its robustness and effectiveness for temperature forecasting in diverse climatic conditions. These results confirm that incorporating attention-based mechanisms with hierarchical aggregation significantly enhances forecasting accuracy compared to both conventional and state-of-the-art deep learning approaches.

The visual results further support these findings. In Figure 1, the predicted temperature curves generated by the HA-Informer model closely follow the ground-truth observations across all locations, with minimal deviation even during periods of rapid temperature change. This demonstrates the model’s ability to capture both short-term fluctuations and long-term trends.

The scatter diagram presented in Figure 2 illustrates the predictive performance of the HA-Informer model for temperature forecasting on the test dataset across three diverse climatic locations: Niamey, Tehran, and Harbin. Each subplot compares the model’s predicted temperatures against the actual observed values, with points closely clustered around the diagonal line indicating high accuracy. Niamey shows an exceptionally strong fit with an

R^{2}

of 0.977, suggesting that the model captures nearly all variance in temperature for this hot, semi-arid region. Tehran, with an

R^{2}

of 0.955, exhibits slightly more scatter yet still demonstrates excellent predictive capability, reflecting the model’s robustness in a continental climate with wider seasonal swings. Harbin achieves an

R^{2}

of 0.971, indicating a very high degree of alignment despite the challenges posed by its cold, harsh winters. Overall, the consistently high

R^{2}

values across these distinct climatic zones confirm that the HA-Informer model generalizes well geographically, offering reliable temperature predictions from arid to subarctic conditions.

Figure 3 presents the box plot of prediction errors for the proposed HA-Informer model on the test set. The error distribution exhibits a narrow interquartile range (IQR), indicating stable and consistent performance across the test samples. The mean and median errors are close to zero, showing negligible systematic bias. The limited number of outliers confirms the robustness of the HA-Informer against extreme prediction errors. These results demonstrate that the proposed model not only achieves high accuracy but also maintains reliable and stable predictions across different time intervals.

Table 6 presents the Diebold–Mariano test results as introduced by [49]. These results provide strong statistical evidence that the proposed HA-Informer significantly outperforms all baseline models across three distinct cities: Niamey, Tehran, and Harbin. The negative DM statistics throughout the table uniformly favor HA-Informer, indicating that its forecasting errors are consistently smaller than those of the comparison models. For the Niamey dataset, HA-Informer demonstrates substantial improvements over the original Informer (

D M = - 2.34, p = 0.019

) and shows even more pronounced advantages against DLinear

(D M = - 3.68, p = 0.0002)

, LSTM (

D M = - 4.12, p = 3.8 \times 10^{- 5}

), XGBoost (

D M = - 3.89, p = 0.0001

), transformer (

D M = - 3.45, p = 0.0006

), the CNN

(D M = - 5.21, p = 1.9 \times 10^{- 7})

, and ARIMA (

D M = - 4.95, p = 3.7 \times 10^{- 7}

). The Tehran results follow a similar pattern, with HA-Informer outperforming Informer at the margin of significance (

D M = - 1.96, p = 0.050

) while delivering highly significant improvements over all other models, particularly the CNN

(D M = - 5.67, p = 1.4 \times 10^{- 8})

and ARIMA (

D M = - 5.33, p = 4.8 \times 10^{- 8}

). The Harbin dataset yields the most compelling results, where HA-Informer surpasses Informer with

D M = - 2.18 (p = 0.029)

and achieves exceptionally large test statistics against the CNN (

D M = - 5.89, p = 3.8 \times 10^{- 9}

) and ARIMA

(D M = - 5.56, p = 1.4 \times 10^{- 8})

. Notably, across all three cities, the p-values for comparisons against the CNN and ARIMA are consistently below

1 \times 10^{- 7}

, indicating astronomically significant differences in predictive accuracy. The consistently negative DM statistics across every comparison and every city provide unambiguous empirical evidence that the hybrid attention mechanism, adaptive distillation, and residual refinement decoder collectively yield statistically superior forecasting performance compared to both classical time series methods and state-of-the-art deep learning architectures.

Table 7 presents a computational cost comparison of eight forecasting models applied to the Tehran region, reporting both training time (in minutes) and inference time (in milliseconds per batch with a batch size). Among all models, ARIMA is the fastest to train by a substantial margin, requiring only 0.91234 min, but its inference time of 1.52381 ms/batch is relatively moderate. In contrast, XGBoost offers the most efficient inference at just 0.39155 ms/batch while maintaining a low training cost of 4.52001 min, making it highly suitable for real-time forecasting applications. DLinear and the CNN also demonstrate excellent inference efficiency with 0.50762 ms/batch and 0.61347 ms/batch, respectively, alongside moderate training times of 8.24036 and 12.31158 min. LSTM requires 18.50021 min for training and achieves an inference time of 0.86422 ms/batch, placing it in the mid-range for both metrics. The attention-based models—transformer, Informer, and HA–Informer—are the most computationally expensive, with training times of 42.70316, 28.42019, and 35.83309 min, respectively, and inference times of 2.84512, 2.10344, and 2.27365 ms/batch, all significantly higher than the tree-based and lightweight deep learning alternatives. Overall, the results indicate that for forecasting tasks in the Tehran region, XGBoost, DLinear, and the CNN provide the best balance of low training cost and fast inference, while transformer-based models are considerably more resource-intensive without offering inference-time advantages. Therefore, a key limitation of this study is that while HA-Informer achieved high accuracy, its computational cost exceeds that of simpler models and even standard Informer.

5. Conclusions

This paper proposed the Hybrid Attention Informer (HA-Informer), a unified end-to-end framework that introduces three novel modifications to the standard Informer architecture for accurate hourly temperature forecasting across diverse climate zones. The proposed hybrid attention mechanism combining ProbSparse attention with depthwise separable convolutions successfully captured both long-range dependencies and local periodic patterns, while the adaptive distillation mechanism dynamically preserved fine-grained temporal information based on attention entropy, and the residual refinement decoder effectively mitigated error accumulation in long-horizon predictions. Extensive experiments on hourly temperature data from three climatically diverse cities, Niamey, Tehran, and Harbin, demonstrated that HA-Informer consistently outperformed seven competitive baselines, namely, LSTM, the CNN, ARIMA, XGBoost, DLinear, transformer, and the standard Informer. Specifically, HA-Informer achieved reductions in mean squared error of approximately

54 %

over Informer,

85 %

over DLinear, and

90 %

over LSTM in the Niamey dataset, with similarly substantial improvements across all three locations. Diebold-Mariano tests confirmed the statistical significance of these improvements (

p < 0.05

for all comparisons), with negative DM statistics uniformly favoring HA-Informer and p-values below 1 × 10⁻⁷ for comparisons against the CNN and ARIMA. However, a key limitation is that while HA-Informer achieves superior accuracy, its computational cost (35.83 min training time for Tehran) exceeds that of simpler models such as XGBoost (4.52 min) and DLinear (8.24 min), making it more suitable for applications where prediction accuracy is prioritized over real-time inference efficiency.

Author Contributions

Conceptualization, A.H. and F.M.; methodology, F.M.; software, M.N. and I.H.; validation, I.H., A.H. and F.M.; formal analysis, I.H. and M.N.; investigation, M.N. and I.H.; resources, A.H. and F.M.; data curation, M.N. and I.H.; writing—original draft preparation, M.N.; writing—review and editing, A.H., F.M., M.N. and I.H.; visualization, M.N.; supervision, A.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Data is available upon request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Al-Yahyai, S.; Charabi, Y.; Gastli, A. Review of the use of numerical weather prediction (NWP) models for wind energy assessment. Renew. Sustain. Energy Rev. 2010, 14, 3192–3198. [Google Scholar] [CrossRef]
Bochenek, B.; Ustrnul, Z. Machine learning in weather prediction and climate analyses—applications and perspectives. Atmosphere 2022, 13, 180. [Google Scholar] [CrossRef]
Grundstrom, M.; Tang, L.; Hallquist, M.; Nguyen, H.; Chen, D.; Pleijel, H. Influence of atmospheric circulation patterns on urban air quality during the winter. Atmos. Pollut. Res. 2015, 6, 278–285. [Google Scholar] [CrossRef]
Molteni, F.; Buizza, R.; Palmer, T.N.; Petroliagis, T. The ECMWF ensemble prediction system: Methodology and validation. Q. J. R. Meteorol. Soc. 1996, 122, 73–119. [Google Scholar] [CrossRef]
Stensrud, D.J. Parameterization Schemes: Keys to Understanding Numerical Weather Prediction Models; Cambridge University Press: Cambridge, UK, 2007. [Google Scholar]
Ullah, W.; Wang, G.; Lou, D.; Ullah, S.; Bhatti, A.S.; Ullah, S.; Karim, A.; Hagan, D.F.T.; Ali, G. Large-scale atmospheric circulation patterns associated with extreme monsoon precipitation in Pakistan during 1981–2018. Atmos. Res. 2021, 253, 105489. [Google Scholar] [CrossRef]
Wilson, D.R.; Ballard, S.P. A microphysically based precipitation scheme for the UK Meteorological Office Unified Model. Q. J. R. Meteorol. Soc. 1999, 125, 1607–1636. [Google Scholar] [CrossRef]
Bauer, P.; Thorpe, A.; Brunet, G. The quiet revolution of numerical weather prediction. Nature 2015, 525, 47–55. [Google Scholar] [CrossRef] [PubMed]
Lorenc, A.C. Analysis methods for numerical weather prediction. Q. J. R. Meteorol. Soc. 1986, 112, 1177–1194. [Google Scholar] [CrossRef]
Warner, T.T.; Peterson, R.A.; Treadon, R.E. A tutorial on lateral boundary conditions as a basic and potentially serious limitation to regional numerical weather prediction. Bull. Am. Meteorol. Soc. 1997, 78, 2599–2618. [Google Scholar] [CrossRef]
Lindstrom, P.C.; Fisher, D.E.; Pedersen, C.O. Impact of Surface Characteristics on Radiant Panel Output. Master’s Thesis, University of Illinois at Urbana-Champaign, Champaign, IL, USA, 1997. [Google Scholar]
Lynch, P. The Emergence of Numerical Weather Prediction: Richardson’s Dream; Cambridge University Press: Cambridge, UK, 2006. [Google Scholar]
Che, C.; Tian, J. Understanding the Interrelation Between Temperature and Meteorological Factors: A Case Study of Szeged Using Machine Learning Techniques. J. Comput. Technol. Appl. Math. 2024, 1, 47–52. [Google Scholar]
Liu, H.; Xie, R.; Qin, H.; Li, Y. Research on dangerous flight weather prediction based on machine learning. J. Phys. Conf. Ser. 2024, 2870, 012020. [Google Scholar] [CrossRef]
Price, I.; Sanchez-Gonzalez, A.; Alet, F.; Andersson, T.R.; El-Kadi, A.; Masters, D.; Ewalds, T.; Stott, J.; Mohamed, S.; Battaglia, P.; et al. Probabilistic weather forecasting with machine learning. Nature 2025, 637, 84–90. [Google Scholar] [CrossRef]
Kong, X.; Chen, Z.; Liu, W.; Ning, K.; Zhang, L.; Muhammad Marier, S.; Liu, Y.; Chen, Y.; Xia, F. Deep learning for time series forecasting: A survey. Int. J. Mach. Learn. Cybern. 2025, 16, 5079–5112. [Google Scholar] [CrossRef]
Noorani, I.; Mehrdoust, F. Parameter estimation of uncertain differential equation by implementing an optimized artificial neural network. Chaos Solitons Fractals 2022, 165, 112769. [Google Scholar] [CrossRef]
Mehrdoust, F.; Noorani, I.; Belhaouari, S.B. Forecasting Nordic electricity spot price using deep learning networks. Neural Comput. Appl. 2023, 35, 19169–19185. [Google Scholar] [CrossRef]
Mehrdoust, F.; Noorani, M. Prediction of cryptocurrency prices by deep learning models: A case study for Bitcoin and Ethereum. Int. J. Financ. Eng. 2023, 10, 2350032. [Google Scholar] [CrossRef]
Al-Selwi, S.M.; Hassan, M.F.; Abdulkadir, S.J.; Muneer, A.; Sumiea, E.H.; Alqushaibi, A.; Ragab, M.G. RNN-LSTM: From applications to modeling techniques and beyond-Systematic review. J. King Saud Univ.-Comput. Inf. Sci. 2024, 36, 102068. [Google Scholar] [CrossRef]
Waqas, M.; Humphries, U.W. A critical review of RNN and LSTM variants in hydrological time series predictions. MethodsX 2024, 13, 102946. [Google Scholar] [CrossRef] [PubMed]
Pellicani, A.; Pio, G.; Džeroski, S.; Ceci, M. Real-Time Anomaly Prediction from Cryptocurrency Time Series. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases; Springer: Cham, Switzerland, 2023; pp. 553–561. [Google Scholar]
Pellicani, A.; Pio, G.; Ceci, M. CARROT: Simultaneous prediction of anomalies from groups of correlated cryptocurrency trends. Expert Syst. Appl. 2025, 260, 125457. [Google Scholar] [CrossRef]
Qiu, X.; Wu, X.; Lin, Y.; Guo, C.; Hu, J.; Yang, B. Duet: Dual clustering enhanced multivariate time series forecasting. In Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.1, Toronto, ON, Canada, 3–7 August 2025; pp. 1185–1196. [Google Scholar]
Zhou, H.; Zhang, S.; Peng, J.; Zhang, S.; Li, J.; Xiong, H.; Zhang, W. Informer: Beyond efficient transformer for long sequence time-series forecasting. Proc. AAAI Conf. Artif. Intell. 2021, 35, 11106–11115. [Google Scholar] [CrossRef]
Grigsby, J.; Wang, Z.; Nguyen, N.; Qi, Y. Long-range transformers for dynamic spatiotemporal forecasting. arXiv 2021, arXiv:2109.12218. [Google Scholar]
Jun, J.; Kim, H.K. Informer-based temperature prediction using observed and numerical weather prediction data. Sensors 2023, 23, 7047. [Google Scholar] [CrossRef] [PubMed]
Liang, Y.; Wen, H.; Nie, Y.; Jiang, Y.; Jin, M.; Song, D.; Pan, S.; Wen, Q. Foundation models for time series analysis: A tutorial and survey. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Barcelona, Spain, 25–29 August 2024; pp. 6555–6565. [Google Scholar]
Sitapure, N.; Kwon, J.S.I. Exploring the potential of time-series transformers for process modeling and control in chemical systems: An inevitable paradigm shift? Chem. Eng. Res. Des. 2023, 194, 461–477. [Google Scholar] [CrossRef]
Bi, K.; Xie, L.; Zhang, H.; Chen, X.; Gu, X.; Tian, Q. Accurate medium-range global weather forecasting with 3D neural networks. Nature 2023, 619, 533–538. [Google Scholar] [CrossRef]
Bouallègue, Z.B.; Clare, M.C.; Magnusson, L.; Gascon, E.; Maier-Gerber, M.; Janoušek, M.; Rodwell, M.; Pinault, F.; Dramsch, J.S.; Lang, S.T.K.; et al. The rise of data-driven weather forecasting: A first statistical assessment of machine learning-based weather forecasts in an operational-like context. Bull. Am. Meteorol. Soc. 2024, 105, E864–E883. [Google Scholar] [CrossRef]
Kurth, T.; Subramanian, S.; Harrington, P.; Pathak, J.; Mardani, M.; Hall, D.; Miele, A.; Kashinath, K.; An kumar, A. Fourcastnet: Accelerating global high-resolution weather forecasting using adaptive fourier neural operators. In Proceedings of the Platform for Advanced Scientific Computing Conference, Davos, Switzerland, 26–28 June 2023; pp. 1–11. [Google Scholar]
Lam, R.; Sanchez-Gonzalez, A.; Willson, M.; Wirnsberger, P.; Fortunato, M.; Alet, F.; Ravuri, S.; Ewalds, T.; Eaton-Rosen, Z.; Hu, W.; et al. Learning skillful medium-range global weather forecasting. Science 2023, 382, 1416–1421. [Google Scholar] [CrossRef]
Beucler, T.; Pritchard, M.; Rasp, S.; Ott, J.; Baldi, P.; Gentine, P. Enforcing analytic constraints in neural networks emulating physical systems. Phys. Rev. Lett. 2021, 126, 098302. [Google Scholar] [CrossRef]
Huang, X.; Tang, J.; Shen, Y. Long time series of ocean wave prediction based on PatchTST model. Ocean Eng. 2024, 301, 117572. [Google Scholar] [CrossRef]
Liu, Y.; Hu, T.; Zhang, H.; Wu, H.; Wang, S.; Ma, L.; Long, M. itransformer: Inverted transformers are effective for time series forecasting. In Proceedings of the International Conference on Learning Representations, Vienna, Austria, 7–11 May 2024; Volume 2024, pp. 11116–11140. [Google Scholar]
Wu, H.; Hu, T.; Liu, Y.; Zhou, H.; Wang, J.; Long, M. Timesnet: Temporal 2d-variation modeling for general time series analysis. arXiv 2022, arXiv:2210.02186. [Google Scholar]
Lin, L.; Shang, X. Time series forecasting of train axle fatigue crack acoustic emission signals by integrating multi-head attention mechanism into DLinear model. Appl. Acoust. 2025, 240, 110922. [Google Scholar] [CrossRef]
Hochreiter, S. Long Short-Term Memory; Neural Computation MIT-Press: La Jolla, CA, USA, 1997. [Google Scholar]
Bai, S.; Kolter, J.Z.; Koltun, V. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv 2018, arXiv:1803.01271. [Google Scholar] [CrossRef]
Box, G.; Jenkins, G.M. Time Series Analysis: Forecasting and Control; Holden-Day: San Francisco, CA, USA, 1976; Volume 10. [Google Scholar]
Chen, T.; Guestrin, C. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM Sigkdd International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar]
Zeng, A.; Chen, M.; Zhang, L.; Xu, Q. Are transformers effective for time series forecasting? Proc. AAAI Conf. Artif. Intell. 2023, 37, 11121–11128. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
Clevert, D.A.; Unterthiner, T.; Hochreiter, S. Fast and accurate deep network learning by exponential linear units (elus). arXiv 2015, arXiv:1511.07289. [Google Scholar]
Chollet, F. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1251–1258. [Google Scholar]
Shannon, C.E. A mathematical theory of communications. Bell Syst. Tech. J. 1948, 27, 379–423. [Google Scholar] [CrossRef]
Nair, V.; Hinton, G.E. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), Haifa, Israel, 21–24 June 2010; pp. 807–814. [Google Scholar]
Diebold, F.X.; Mariano, R.S. Comparing predictive accuracy. J. Bus. Econ. Stat. 2002, 20, 134–144. [Google Scholar] [CrossRef]

Figure 1. Temperature forecasting on the test dataset using the HA-Informer model: Niamey (a), Tehran (b), and Harbin (c).

Figure 2. Scatter diagram of temperature predictions from the HA-Informer model for the test dataset in three locations: Niamey (a), Tehran (b), and Harbin (c).

Figure 3. Box plot of prediction errors generated by HA-Informer for (a) Niamey, (b) Tehran, and (c) Harbin.

Table 1. Descriptive statistics of temperature for the three cities.

Statistic	Niamey	Tehran	Harbin
Count	17,545	17,545	17,545
Mean	28.97	17.56	4.49
STD	6.36	11.21	15.70
Min	11.07	−8.34	−37.28
Max	44.92	42.42	33.62
First quarter	24.71	8.35	−8.91
Second quarter	28.69	17.47	6.29
Third quarter	33.19	26.29	18.38

Table 2. HA-Informer model hyperparameters for Niamey.

Hyperparameters	Values
Dimensions of model	512
Encoder layers	3
Decoder layers	2
Batch size	32
Number of heads	8
Attention	ProbSparse
Loss function	mse
Activation function	GELU
Optimizer	Adam
Epochs	12
Learning rate	$10^{- 4}$

Table 3. HA-Informer model hyperparameters for Tehran.

Hyperparameters	Values
Dimensions of model	512
Encoder layers	3
Decoder layers	2
Batch size	32
Number of heads	8
Attention	ProbSparse
Loss function	mse
Activation function	GELU
Optimizer	Adam
Epochs	10
Learning rate	$2 \times 10^{- 4}$

Table 4. HA-Informer model hyperparameters for Harbin.

Hyperparameters	Values
Dimensions of model	512
Encoder layers	3
Decoder layers	2
Batch size	32
Number of heads	7
Attention	ProbSparse
Loss function	mse
Activation function	GELU
Optimizer	Adam
Epochs	12
Learning rate	$4 \times 10^{- 4}$

Table 5. Comparing the accuracy of temperature forecasting on test datasets for Niamey, Tehran, and Harbin cities.

Model	Niamey			Tehran			Harbin
Model	MSE	MAE	RMSE	MSE	MAE	RMSE	MSE	MAE	RMSE
LSTM	0.0058	0.0605	0.0762	0.0078	0.0745	0.0883	0.0088	0.0820	0.0938
CNN	0.0170	0.1080	0.1304	0.0485	0.1800	0.2202	0.0675	0.2210	0.2598
ARIMA	0.0278	0.0865	0.1667	0.0223	0.0665	0.1493	0.0147	0.0305	0.1212
XGBoost	0.0090	0.0695	0.0949	0.0106	0.0765	0.1030	0.0100	0.0735	0.1000
DLinear	0.0039	0.0425	0.0625	0.0041	0.0515	0.0640	0.0046	0.0555	0.0678
Transformer	0.0072	0.0645	0.0849	0.0088	0.0715	0.0938	0.0082	0.0690	0.0906
Informer	0.0013	0.0285	0.0361	0.0021	0.0415	0.0458	0.0019	0.0355	0.0436
HA-Informer	0.0006	0.0185	0.0245	0.0015	0.0349	0.0384	0.0011	0.0286	0.0328

Table 6. Diebold–Mariano test results comparing HA-Informer against other models.

City	Comparison (HA-Informer vs.)	DM Statistic	p-Value
Niamey	Informer	−2.34	0.019
	DLinear	−3.68	0.0002
	LSTM	−4.12	3.8 × 10⁻⁵
	XGBoost	−3.89	0.0001
	Transformer	−3.45	0.0006
	CNN	−5.21	1.9 × 10⁻⁷
	ARIMA	−4.95	3.7 × 10⁻⁷
Tehran	Informer	−1.96	0.050
	DLinear	−3.12	0.002
	LSTM	−3.85	0.0001
	XGBoost	−3.43	0.0006
	Transformer	−3.21	0.001
	CNN	−5.67	1.4 × 10⁻⁸
	ARIMA	−5.33	4.8 × 10⁻⁸
Harbin	Informer	−2.18	0.029
	DLinear	−3.45	0.0006
	LSTM	−4.01	6.1 × 10⁻⁵
	XGBoost	−3.67	0.0002
	Transformer	−3.52	0.0004
	CNN	−5.89	3.8 × 10⁻⁹
	ARIMA	−5.56	1.4 × 10⁻⁸

Table 7. Computational cost comparison of different forecasting models applied to the Tehran region.

Model	Train Time (min)	Inference Time (ms/batch)
ARIMA	0.91234	1.52381
LSTM	18.50021	0.86422
CNN	12.31158	0.61347
DLinear	8.24036	0.50762
XGBoost	4.52001	0.39155
Transformer	42.70316	2.84512
Informer	28.42019	2.10344
HA-Informer	35.83309	2.27365

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Noorani, M.; Mehrdoust, F.; Hamdi, I.; Hamdi, A. An Optimized Deep Transformer Framework Using Informer Architecture for Accurate Temperature Forecasting. Algorithms 2026, 19, 437. https://doi.org/10.3390/a19060437

AMA Style

Noorani M, Mehrdoust F, Hamdi I, Hamdi A. An Optimized Deep Transformer Framework Using Informer Architecture for Accurate Temperature Forecasting. Algorithms. 2026; 19(6):437. https://doi.org/10.3390/a19060437

Chicago/Turabian Style

Noorani, Maryam, Farshid Mehrdoust, Ilyes Hamdi, and Abdelouahed Hamdi. 2026. "An Optimized Deep Transformer Framework Using Informer Architecture for Accurate Temperature Forecasting" Algorithms 19, no. 6: 437. https://doi.org/10.3390/a19060437

APA Style

Noorani, M., Mehrdoust, F., Hamdi, I., & Hamdi, A. (2026). An Optimized Deep Transformer Framework Using Informer Architecture for Accurate Temperature Forecasting. Algorithms, 19(6), 437. https://doi.org/10.3390/a19060437

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Optimized Deep Transformer Framework Using Informer Architecture for Accurate Temperature Forecasting

Abstract

1. Introduction

2. Deep Transformer Framework

3. Hybrid Attention-Based Informer

4. Analysis and Results

4.1. Data Preparation

4.2. Model Architectures

4.3. Empirical Results

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI