1. Introduction
Estimating future temperature is a fundamental component of weather forecasting systems. As climate change accelerates, accurate temperature forecasting becomes increasingly significant for agriculture, disaster management, and infrastructure planning. According to [
1], traditional approaches to weather forecasting, particularly temperature estimation, predominantly depend on numerical weather prediction (NWP) models. By solving sophisticated physical and dynamic equations that describe atmospheric motion and its variations, these models employ a complex set of processes to estimate future atmospheric conditions [
2,
3,
4,
5,
6,
7].
Despite their widespread use, NWP models face inherent and growing limitations. Their precision declines as global warming intensifies and climate change progresses [
8]. Since the 1980s, while NWP models have seen significant improvements due to advancements in computational capabilities, better modeling methods, and more precise data assimilation [
9], they have still suffered from critical drawbacks. These include uncertainties in initial conditions and boundary settings [
10], oversimplifications of surface characteristics and their impact on outputs [
11], and structural limitations in how closely they represent real-world scenarios [
12]. Consequently, researchers have increasingly explored alternative approaches.
In response to these NWP limitations, numerous researchers have explored deep learning methods for predicting and forecasting different meteorological parameters [
13,
14,
15]. Deep learning models have emerged as potential alternatives to address the issues encountered in complex system evaluation and time series prediction [
16,
17,
18,
19], offering the ability to unravel intricate, non-linear patterns within data. Due to their capability to utilize past data for forecasting present and future states, recurrent neural networks (RNNs) have been widely applied in various studies to estimate upcoming time-step variables [
20,
21]. However, as network depth grows, RNNs frequently encounter the issue of gradient vanishing during training. Recent studies have therefore turned to clustering-based approaches that leverage temporal correlations among related sequences to enhance forecasting robustness. Pellicani et al. [
22] introduced a real-time anomaly prediction method that clusters correlated cryptocurrencies and uses an online multi-target LSTM to forecast anomalies amid market volatility. Pellicani et al. [
23] proposed CARROT, a novel approach that clusters cryptocurrencies by temporal correlation and uses multi-target LSTM per cluster to forecast anomalies, helping maximize profit or minimize losses. A dual-clustering framework on temporal and channel dimensions to enhance multivariate time series forecasting, despite heterogeneous patterns and complex inter-channel correlations, was investigated by Qiu et al. in [
24]. However, as the temporal gap widens, long-term dependency challenges emerge, reducing the impact of earlier information on the present time step.
To overcome the long-term dependency problem, transformer models have been introduced as promising solutions [
25]. Transformer models, created to capture relationships across various time steps [
26,
27,
28,
29], improve the ability to forecast extended sequences. Modern advancements in deep learning have resulted in weather forecasting models that achieve predictive accuracy on par with operational systems used by national meteorological agencies [
30,
31,
32,
33]. These models are constructed using ERA5 (representing the fifth-generation atmospheric reanalysis produced by ECMWF provides global climate information from January 1940 to the present and exhibits forecasting capability even when applied to initial conditions beyond those seen in their training data. Unlike other deep learning approaches that incorporate physical constraints directly into their architectures (e.g., [
34]), it remains uncertain whether these models have truly internalized atmospheric processes, such as air motion dynamics and the propagation of disturbances, or if they are merely identifying statistical patterns that minimize prediction errors for subsequent data points.
In recent years, transformer-based architectures have significantly advanced time series forecasting. Huang et al. [
35] proposed a long time series of ocean wave prediction based on PatchTST model. Liu et al. [
36] proposed iTransformer, a simple yet effective architecture that applies attention and feed-forward networks on the inverted dimensions, with variables as tokens for time series forecasting. Wu et al. [
37] introduced TimesNet, a general time series analysis method that models temporal variations by transforming 1D time series into 2D tensors to capture intraperiod and interperiod-variations. Lin and Shang [
38] provided multi-head attention in DLinear to forecast acoustic emission signals for early real-time crack detection in train axles, preventing safety accidents. Despite these advances, a critical gap remains. While transformers improve long-range alignment, their self-attention mechanism incurs high computational and memory demands when handling long sequences, restricting practical deployment for extended forecasting. Moreover, it remains uncertain whether these models have truly internalized atmospheric physics or are merely minimizing prediction errors through statistical patterns. If models could be shown to generate physically aligned solutions, they would offer a remarkable opportunity for rapid hypothesis testing.
To overcome the efficiency limitations of standard transformers, the Informer model has been proposed. The Informer introduces a ProbSparse self-attention mechanism that dramatically improves computational and memory efficiency, enabling long time-series prediction in a single forward pass via a generative-style decoder [
25]. While the standard Informer model has demonstrated promising results for long-sequence time-series forecasting [
25], its direct application to meteorological data, particularly hourly temperature prediction across diverse climate zones, remains limited. Furthermore, the original Informer architecture has three inherent limitations when applied to highly volatile climate data: (i) the ProbSparse attention mechanism, though efficient, may overlook local periodic patterns that are critical for temperature forecasting; (ii) the fixed distillation process applies uniform compression regardless of attention diversity, potentially discarding fine-grained temporal information; and (iii) the decoder lacks explicit error correction mechanisms for long-horizon predictions, leading to accumulated forecasting errors.
To address these limitations, this paper proposes the HA-Informer, which introduces three specific modifications to the original architecture. First, a hybrid attention mechanism is proposed that augments the ProbSparse attention with a parallel learnable local pattern extraction module using depthwise separable convolutions, enabling the model to capture both long-range dependencies and short-term periodic fluctuations simultaneously. Second, an adaptive distillation mechanism is developed that dynamically adjusts feature compression based on attention entropy, preserving fine-grained temporal details when attention patterns are highly dispersed. Third, a residual refinement decoder is designed that incorporates a learnable skip connection to reduce error accumulation in long-horizon forecasting. These three innovations are integrated into a unified end-to-end trainable framework, specifically tailored for hourly temperature prediction across hot, temperate, and cold climate zones. To the best of our knowledge, this work is the first to introduce entropy-guided adaptive distillation and hybrid convolution-attention mechanisms into the Informer architecture for meteorological time-series forecasting.
Consistent with the architectural innovations described above, this study pursues three major objectives, each supported by a dedicated contribution:
Simultaneously capture long-range dependencies and short-term periodic fluctuations in hourly temperature data via a hybrid attention mechanism that enhances ProbSparse attention with a parallel depthwise separable convolution branch, using a learnable mixing parameter to adaptively balance global context and local patterns.
Prevent the loss of fine-grained temporal information during sequence compression in the encoder through an adaptive distillation mechanism that dynamically adjusts the pooling ratio based on attention entropy, preserving informative details when attention is widely dispersed rather than applying uniform compression.
Mitigate error accumulation in long-horizon temperature forecasting by introducing a residual refinement decoder with a learnable skip connection, providing an auxiliary gradient pathway that directly corrects attention errors and ensures sublinear error growth relative to forecast length.
These three contributions are not independent add-ons but are jointly optimized within a unified end-to-end framework, which we denote as the HA-Informer. The proposed model was evaluated against seven competitive baselines—LSTM [
39], a CNN [
40], ARIMA [
41], XGBoost [
42], DLinear [
43], transformer [
44], and the standard Informer [
25]—using hourly temperature data from three climatically diverse cities: Niamey (hot), Tehran (temperate), and Harbin (cold).
The remainder of this paper is organized as follows.
Section 2 describes the deep transformer framework.
Section 3 presents the proposed HA-Informer and its three key innovations.
Section 4 covers the experimental setup and empirical results, including error comparisons, visual assessments, statistical tests, and computational analysis. Finally,
Section 5 concludes the paper.
2. Deep Transformer Framework
The transformer model, as introduced by Vaswani et al. [
44], processes sequential data using a self-attention mechanism that enables parallel computation and captures long-range dependencies. In the following, we provide a comprehensive mathematical description of the core components of the transformer, which we adopt as the foundation of our deep informer framework. Let an input sequence be represented as a real-valued matrix
where
L is the length of the sequence (number of tokens or time steps) and
is the embedding dimension of each token. Each row vector
corresponds to the initial representation of the
i-th element in the sequence. Because the transformer architecture is permutation–invariant—meaning that without additional information it would treat the sequence as an unordered set—positional encodings are added to the input to retain information about the absolute and relative order of the sequence. The positionally encoded input is given by
where
is the positional vector for position index
i. A common choice is sinusoidal encoding, which has the advantage of allowing the model to extrapolate to sequence lengths longer than those seen during training. For each dimension index
k satisfying
, we define the even and odd components of
as
where
T is a constant controlling the maximum wavelength; in the original transformer formulation,
. The intuition is that each dimension of the positional encoding oscillates at a different frequency, allowing the attention mechanism to easily learn to attend to relative positions.
Given the position-encoded input, the core mechanism of the transformer is attention, which computes a weighted sum of values based on the similarity between queries and keys. Define three matrices derived from the input: the query matrix
, the key matrix
, and the value matrix
. In the simplest case of self-attention, all three come from the same input
, but in general they can be different. The scaled dot-product attention is defined as
Here, the dimensions are
where
is the dimension of the query and key vectors, and
is the dimension of the value vectors. The product
yields an
matrix of raw attention scores, where entry
is the dot product between the query at position
i and the key at position
j, indicating how much position
i should attend to position
j. These scores are scaled down by
to prevent excessively large values that would push the softmax function into regions of extremely small gradients. The softmax function is applied row-wise to convert the scores into a probability distribution over the
L positions. For any matrix
, the softmax is defined elementwise as
The resulting
attention weight matrix is then multiplied by
to produce an
output, where each row is a weighted sum of the value vectors. To enable the model to focus on different representational subspaces simultaneously, multi-head attention is introduced. Instead of performing a single attention operation, we compute
h independent attention operations in parallel, each with its own learned projections. The multi-head attention output is defined as
where each attention head is given by
The learnable projection matrices have the following dimensions:
In practice, we typically set
to keep the computational cost constant across different numbers of heads. The output of each head
has dimension
; concatenating all
h heads yields an
matrix, which is then multiplied by
to project back to
.
The transformer encoder is built by stacking multiple identical layers, each composed of a multi-head attention sublayer followed by a position-wise feed-forward network (FFN). Around each sublayer, a residual connection is used, followed by layer normalization (LayerNorm). Let
be the input to an encoder layer. The first sublayer performs multi-head self-attention where the queries, keys, and values all come from the same input
. The intermediate output
is computed as
Here, the addition
is the residual connection, which helps with gradient flow during backpropagation. LayerNorm is applied to the sum, normalizing across the feature dimension. The second sublayer is a position-wise feed-forward network, which applies the same fully connected network independently to each position. The final output
of the encoder layer is
The feed-forward network itself is a two-layer transformation with a ReLU activation in between. For a single position vector
(a row of
), the FFN is defined as
where
,
, and
,
are learnable parameters. The hidden dimension
is typically larger than
(e.g.,
). The
operation applies the ReLU activation function elementwise.
When a decoder is used in sequence-to-sequence tasks such as machine translation, it resembles the encoder but includes additional components. The decoder consists of three sublayers per layer: a masked multi-head self-attention sublayer, a multi-head encoder–decoder attention sublayer, and a feed-forward network. In the masked self-attention, the attention mechanism is prevented from attending to subsequent positions by applying a mask matrix to the attention scores before the softmax, where for and otherwise. This ensures that the prediction for position i depends only on known outputs at positions less than i. The encoder–decoder attention sublayer uses queries from the previous decoder sublayer but keys and values from the encoder output, allowing the decoder to attend to all positions of the input sequence.
A fundamental limitation of the standard transformer is its computational complexity. The self-attention mechanism requires computing the
matrix
and then the softmax over each of its rows. The total time complexity is
Since
is typically much smaller than
L for long sequences, the dominant term is
. This quadratic scaling with respect to the sequence length
L becomes prohibitive for long sequences, such as entire books, high-resolution images treated as sequences of pixels, or long time series. This limitation motivates the development of sparse attention variants, where only a subset of the
pairwise interactions are computed, or linearized attention variants, where the softmax kernel is approximated to achieve
complexity.
3. Hybrid Attention-Based Informer
The Informer model, introduced by Zhou et al. [
25], addresses the fundamental challenge of Long Sequence Time-series Forecasting (LSTF), where the goal is to predict an output sequence of length
given an input sequence of length
, with
often being very large (e.g., 720 time steps ahead). The standard transformer [
44] suffers from quadratic complexity
in the input sequence length
L, making it prohibitive for LSTF. The Informer overcomes this through three key innovations: the ProbSparse self-attention mechanism, self-attention distillation, and a generative-style decoder. To understand these innovations, we first formalize the input and output representations.
Let an input sequence of multivariate time series be defined as
where each
is a
-dimensional observation at time step
i within window
t. The target output sequence is
Having established the sequence definitions, we now turn to the core mechanism that enables the Informer to achieve efficiency.
The core of the Informer is the ProbSparse attention mechanism. For a standard self-attention layer with query matrix
, key matrix
, and value matrix
, the attention output is
This computation requires evaluating all
query–key pairs, leading to
time complexity. The ProbSparse mechanism observes that the attention distribution often exhibits a long-tail property: only a few query–key pairs contribute significantly to the softmax output. Consequently, for a given query
and key set
K of size
, the sparsity of the attention distribution is measured using the Kullback–Leibler divergence between the uniform distribution and the true attention distribution. This leads to the max-mean sparsity measure
To justify this measure theoretically, we examine the following bound.
The theoretical bound for this measure satisfies
From this bound, we see that queries with larger
values correspond to more diverse attention distributions and are more likely to capture dominant dot-product associations. Therefore, only the top
queries with the largest
values are retained, forming the sparse query matrix
. The ProbSparse attention is then computed as
. This reduction in the number of queries directly reduces the time complexity from
to
. However, the encoder still needs to handle long sequences efficiently, which brings us to the distillation mechanism.
The Informer encoder employs a distillation mechanism to progressively reduce the sequence length while retaining dominant features. For the
j-th encoder layer, the distillation operation from layer
j to layer
is defined as [
45]
where
denotes the output of the attention block at layer
j,
applies a one-dimensional convolution with kernel size 3,
is the Exponential Linear Unit activation function defined by
for
and
for
, and
applies max-pooling with stride 2 and kernel size 2, reducing the sequence length by exactly half. Multiple distillation layers are stacked in this manner, and the feature maps from all layers are concatenated to form the final encoder output. Once the encoder has compressed the input sequence, the decoder must generate the forecast in an efficient manner.
The Informer decoder is designed for generative inference, producing the entire output sequence in a single forward pass. The decoder input is constructed as
where
is the starting token sequence derived from the encoder output (typically the last
time steps), and
is a placeholder sequence initialized to zeros. To ensure causality during generation, a masked multi-head self-attention mechanism sets attention scores to
for all pairs where the target position index is less than the source position index:
with the mask matrix
M defined by
for
and
for
. The final prediction is obtained through a linear projection
Despite these innovations, the original Informer has limitations that motivate further improvement, which leads us to propose the Hybrid Attention-based Informer.
Now we provide the proposed Hybrid Attention-based Informer. The HA-Informer retains the overall encoder-decoder architecture of the original Informer but replaces or augments three critical components with hybrid mechanisms that explicitly address the limitations of the standard Informer. Specifically, the first modification is the introduction of a hybrid attention mechanism that combines the global sparse ProbSparse attention with a local convolutional branch, the second modification replaces the fixed distillation operation with an adaptive distillation mechanism that dynamically adjusts the compression ratio based on the entropy of the attention distribution, and the third modification augments the decoder with a residual refinement network that provides an auxiliary gradient pathway and corrects attention errors. We begin with the hybrid attention mechanism.
The Hybrid Attention mechanism in the HA-Informer is defined as a convex combination of the original ProbSparse attention and a depthwise separable convolutional operation applied to the value matrix:
Here,
is a learnable scalar parameter initialized to 0.5, allowing the model to automatically balance between global and local feature extraction during training. The term
denotes a depthwise separable one-dimensional convolution [
46] with kernel size
. For an input tensor of shape
, the depthwise separable convolution first applies a depthwise convolution, where each input channel is convolved independently with its own kernel, requiring
parameters, followed by a pointwise convolution (a
convolution) that mixes information across channels, requiring
parameters. The total parameter complexity is
, compared to
for a standard convolution. The hybrid attention output retains the same dimensions as the input, and the parameter
is optimized jointly with all other network parameters using gradient descent. Having enhanced the attention mechanism, we next address the distillation process.
The adaptive distillation mechanism in the HA-Informer replaces the fixed max-pooling operation with an entropy-guided adaptive pooling operation. For the attention block output at layer
j, let the attention probabilities for query
i be denoted by
, forming a probability distribution over the key positions. The Shannon entropy [
47] of the attention distribution is computed as
This entropy measures the dispersion of attention: low entropy indicates that attention is concentrated on a few key positions, while high entropy indicates that attention is widely dispersed across many positions. The entropy is then passed through a sigmoid activation function to produce a gating factor
Moreover, the adaptive pooling ratio is defined as a function of
, and
where
and
are constants determined by the specific pooling strategy. From [
45], the adaptive distillation operation from layer
j to layer
is then
where
applies pooling with the dynamically determined ratio
, and ⊙ denotes element-wise multiplication by the gating factor
. The product
further scales the pooled features, allowing the model to down-weight features from layers with highly dispersed attention. After improving the encoder with adaptive distillation, we turn our attention to decoder enhancements.
The Residual Refinement Decoder in the HA-Informer adds an auxiliary refinement pathway that bypasses the masked attention mechanism. The decoder output is computed as
where
is a two-layer position-wise feedforward network with hidden dimension
and ReLU activation [
48]
with
,
,
, and
. The scalar parameter
is learnable and initialized to 0.1, controlling the contribution of the refinement pathway. The term
adds directly to the masked attention output before the final linear projection, providing a residual connection that allows gradients to flow directly from the loss to the decoder input without passing through the attention mechanism. With all three components defined, we now describe how the entire HA-Informer is trained.
The HA-Informer is trained end-to-end using the mean squared error loss function
where
is the predicted value at position
i and
is the ground truth. All three additional parameters
,
(computed from attention entropy without additional trainable parameters), and
are optimized jointly with the existing Informer parameters. Having established the HA-Informer architecture, we now demonstrate its mathematical superiority over the standard Informer, beginning with local pattern capture.
The standard Informer uses only ProbSparse attention, which selects queries based on the sparsity measure . For a time series with strong local autocorrelation at lag where , the relevant query–key pairs may not be among the top queries if the local pattern does not produce large values. In contrast, the HA-Informer’s convolutional branch explicitly models local patterns with complexity . For any local window of size k, the convolutional branch guarantees that local interactions are captured regardless of the sparsity measure, whereas the standard Informer may miss them entirely. Furthermore, the mixing parameter adapts to the data: for a purely long-range dependent process, and the HA-Informer recovers the standard Informer; for a process with strong local structure, and the model relies on convolutions. The standard Informer forces a fixed sparsity pattern that cannot adapt to the local-global trade-off. Beyond attention, the distillation process also reveals significant differences between the two models.
On the other hand, in the standard Informer, the distillation operation applies a fixed compression ratio
at every layer, meaning exactly half of the temporal positions are discarded regardless of the information content. Let the attention entropy at layer
j be
. The information loss due to pooling can be quantified as
where
is the variance of the feature map. When attention is highly dispersed (large
), many temporal positions are relevant, and discarding half of them causes significant information loss. The HA-Informer instead uses
. For large entropy
,
approaches 1, so
approaches 0; thus,
. For small entropy
(concentrated attention),
approaches 0, so
approaches 1; thus,
. This means the HA-Informer preserves more information (compresses less) when attention is concentrated, which is when the information is most valuable. More precisely, the information preservation ratio is
while the standard Informer has
always. When attention is highly concentrated (
),
, meaning only
of information is discarded compared to
in the standard Informer. This represents a
reduction in information loss for concentrated attention patterns, which are precisely the patterns that are most important for accurate forecasting. Finally, we examine the decoder behavior for long-horizon forecasting.
Let the prediction error at position
i for the standard Informer be denoted
. Because the decoder uses zero-initialized placeholders and no corrective mechanism, errors propagate autoregressively. For large
, the cumulative error satisfies
where
is the per-step error rate. The HA-Informer decoder adds the refinement term
, which provides a direct mapping from the decoder input to the output. Let the approximation error of the refinement network be bounded by
, i.e.,
. Then the HA-Informer prediction error is bounded by
where we have absorbed the effect of
into the effective
for clarity. Since
is independent of
(the refinement network operates position-wise and does not propagate errors across time), the HA-Informer error grows sublinearly in
while the standard Informer error grows linearly. For sufficiently large
, the HA-Informer achieves
This asymptotic superiority, which follows directly from the error bound analysis, is particularly pronounced for long forecasting horizons where
is large, confirming that the HA-Informer presented in Algorithm 1 outperforms the Informer in the LSTF setting.
| Algorithm 1 HA-Informer algorithm |
- 1:
Input: Input sequence , forecast horizon , number of encoder layers N, number of decoder layers M, constant , kernel size - 2:
Output: Predicted sequence - 3:
Step 1: Positional Encoding - 4:
using sinusoidal functions in ( 1). - 5:
Step 2: Encoder Processing with Adaptive Distillation - 6:
for to N do - 7:
Compute ProbSparse attention with sparsity measure for each query using ( 2). - 8:
Retain top queries with largest to form - 9:
Compute attention output using ( 4). - 10:
Apply residual connection and layer normalization: - 11:
Compute attention entropy as and - 12:
Compute adaptive pooling ratio using ( 5). - 13:
Apply adaptive distillation using ( 6). - 14:
end for - 15:
Concatenate feature maps from all encoder layers: - 16:
Step 3: Decoder Input Construction - 17:
where is typically - 18:
- 19:
- 20:
Step 4: Decoder Processing with Residual Refinement - 21:
for to M do - 22:
Apply masked ProbSparse self-attention with causal mask for , for using ( 3). - 23:
Apply cross-attention using queries from decoder and keys/values from encoder - 24:
Compute refinement output by ( 7). - 25:
Combine masked attention with residual refinement - 26:
end for - 27:
Step 5: Output Projection - 28:
- 29:
Step 6: Loss Computation (Training only) - 30:
Compute Mean Squared Error via ( 8). - 31:
Update parameters , , and all network weights via gradient descent and return .
|
4. Analysis and Results
4.1. Data Preparation
This study explores temperature variations and forecasting models across distinct climatic zones to provide insights into regional and global temperature trends. The selected regions for this analysis are Niamey (Niger), Tehran (Iran), and Harbin (China), chosen to represent three different climate types: hot (Niamey), temperate (Tehran), and cold (Harbin). This diversity enables a comprehensive evaluation of temperature patterns across various ecosystems, enhancing the understanding of how different climatic conditions influence forecasting models. The rationale for selecting these locations is outlined below:
- •
Niamey: Frequent temperature spikes may cause certain models to overfit the data.
- •
Harbin: Prolonged winters may introduce long-term dependencies that models like RNNs or LSTM can struggle to learn effectively.
- •
Tehran: More stable temperature variations enable models to capture trends more easily; however, this regularity may also lead to overfitting, thereby reducing generalization performance.
Specifically, Niamey, located in Niger, experiences a hot desert climate with extreme temperatures and minimal rainfall. Tehran, the capital of Iran, has a semi-arid climate characterized by hot summers and mild winters, influenced by its location between desert plains and mountain ranges. In contrast, Harbin, in northeastern China, features a cold continental climate, with long, harsh winters and short, warm summers. Each of these cities represents a distinct climatic zone, contributing to the diversity of the dataset for analyzing temperature patterns.
In this study, the HA-Informer model is applied as a sequence-to-sequence long-term time-series forecasting framework, where the hourly temperature observations from Niamey, Tehran, and Harbin are first preprocessed into normalized temporal sequences and then used as input tokens to learn long-range dependencies through the ProbSparse self-attention mechanism. The model is trained to capture both short-term fluctuations and long-term seasonal trends in temperature dynamics, and the 80/20 train–test split is strictly preserved to ensure an unbiased evaluation of forecasting performance across all three regions. The temperature data used in this study were obtained from the NASA POWER (Prediction Of Worldwide Energy Resources) Data Access Viewer, which provides publicly accessible and quality-controlled meteorological datasets derived from satellite observations and assimilated models. The data are available at:
https://power.larc.nasa.gov/data-access-viewer/, accessed on 1 February 2026.
It is acknowledged that such factors can influence forecasting accuracy in real-world deployments. However, the datasets used in this work are standard meteorological reanalysis/observational time series where the measurements are aggregated at city-level stations rather than individual heterogeneous sensor networks, and detailed sub-station metadata is not available within the scope of this study. Therefore, the focus is on evaluating model performance under consistent, publicly available time-series conditions rather than on micro-level sensor placement effects. With respect to site selection, the three cities were chosen specifically to represent diverse climatic regimes, namely, arid (Niamey), temperate continental (Tehran), and cold continental (Harbin) conditions, which allows us to test the robustness of the HA-Informer model across varying seasonal patterns and volatility levels. While the reviewer correctly notes that coastal regions or sparse observational networks can introduce additional forecasting challenges, this study does not rely on coastal datasets, and all selected locations are inland urban meteorological series with continuous hourly coverage over the specified periods. This controlled selection helps ensure that differences in performance are primarily attributable to climatic variability rather than data sparsity or sensor distribution effects, thereby strengthening the validity of the comparative evaluation.
The dataset consists of 17,545 hourly temperature recordings from three regions: Niamey, Tehran, and Harbin. The data cover a two-year period from 1 February 2024 to 1 February 2026, ensuring sufficient temporal coverage for training deep learning models. For each city, 80% of the data was used for training, and 20% was reserved for testing to evaluate model performance. Descriptive statistics of the temperature data for all three cities, namely, mean, standard deviation, minimum, maximum, and quartiles, are presented in
Table 1.
4.2. Model Architectures
The main model employed in this research is the HA-Informer, a transformer-based architecture implemented using PyTorch version 2.1.0. After splitting the dataset, several preprocessing steps were applied, including normalization using the StandardScaler method, which scaled the data to have a mean of zero and a standard deviation of one. In this study, we adopted a multi-step forecasting horizon of
(predicting the next 24 h). Moreover, the model’s hyperparameters were optimized through iterative experimentation with various configurations based on the grid search method. The optimal configuration, determined through this process, is summarized in
Table 2,
Table 3 and
Table 4.
To ensure a fair and meaningful comparison, all baseline models, namely, LSTM, a CNN, ARIMA, XGBoost, DLinear, the standard transformer, and the standard Informer, were subjected to the same rigorous hyperparameter optimization procedure as the proposed HA-Informer model. Specifically, for each baseline and each city, we performed a grid search over a carefully designed hyperparameter space, guided by validation set performance. For LSTM, we tuned the number of hidden units (64 to 256), the number of layers (1 to 3), the dropout rate (0.1 to 0.5), and the learning rate (1 × 10−5 to 1 × 10−3). For the CNN, we optimized the kernel size (3, 5, 7), the number of filters (32–128), and the number of convolutional layers (1–3). For ARIMA, we searched over the autoregressive order p (0 to 7), the differencing order d (0 to 2), and the moving average order q (0 to 7) using the Akaike Information Criterion (AIC). For XGBoost, we searched over (100 to 500), (3 to 10), the learning rate (0.01 to 0.3), and the subsample ratio (0.6 to 1.0). For DLinear, we tuned the kernel size for the moving average decomposition (3 to 25), the learning rate (1 × 10−5 to 1 × 10−3), and the batch size (8 to 32). For the standard transformer, we tuned the number of encoder/decoder layers (2 to 4), model dimensions (128 to 512), the number of attention heads (4 to 8), feed-forward dimensions (512 to 2048), and the learning rate (1 × 10−5 to 1 × 10−3). For the standard Informer, we followed the grid search ranges, including model dimensions (256 to 512), encoder/decoder layers (2 to 3), the number of heads (7 to 8), batch size (10 to 12), and the learning rate (3 × 10−4 to 1 × 10−3). All baselines were trained and evaluated on the identical 80/20 train–test splits with the same random seed to ensure reproducibility.
4.3. Empirical Results
The three selected cities inherently present different levels of data-related challenges. Among them, Harbin serves as an example of data limitations due to its extremely cold continental climate, where prolonged sub-freezing temperatures can introduce sensor stability concerns in raw ground observations and create highly volatile temporal patterns that challenge forecasting models. Similarly, Niamey’s hot arid climate poses risks of sensor saturation during extreme heat events, while Tehran’s semi-arid conditions offer relatively stable data characteristics. By deliberately including Harbin—a location where data limitations and climatic complexity are most pronounced—this study evaluates the Informer model under challenging conditions that approximate real-world forecasting difficulties.
As stated, Harbin exhibits the highest frequency of missing observations due to sensor stability challenges under extreme cold conditions, making it a representative case for evaluating model robustness under real-world data limitations. The linear interpolation applied here partially mitigates this issue. Let the raw hourly temperature time series for each location be denoted as
To ensure temporal consistency, missing observations are first identified. For any missing value at time step
t, linear interpolation is applied as
where
are the nearest previous and next observed time indices. To mitigate the influence of outliers, a local moving average is computed
and any observation that satisfies
is considered an outlier and replaced by
, where
is a predefined threshold. Subsequently, min–max normalization is applied to obtain the processed sequence
Following preprocessing, the normalized sequence is transformed into supervised samples using a sliding window approach. The input-output pairs are defined as
where for univariate temperature data
, and
,
denote the input and prediction lengths, respectively. The constructed sequences
are fed into the HA-Informer model, which follows an encoder-decoder architecture. The encoder maps the input sequence into a latent representation
and the decoder generates the predicted output sequence
The model is trained by minimizing a prediction loss function over the training set
where
N is the number of training samples. Thus, the overall pipeline can be expressed as a mapping
followed by evaluation on the test set (20%) to assess generalization performance.
Table 5 presents a quantitative comparison of forecasting accuracy across all models and three cities. The HA-Informer model consistently achieves the lowest error values in terms of MSE, MAE, and RMSE across Niamey, Tehran, and Harbin, followed by the standard Informer as the second-best model. For instance, in Niamey, HA-Informer achieves an MSE of 0.0006, compared to 0.0013 for Informer, 0.0039 for DLinear, and 0.0058 for LSTM, corresponding to improvements of approximately 54%, 85%, and 90%, respectively. Among baseline models, DLinear ranks third due to its effective linear decomposition, while LSTM and XGBoost show moderate performance. The CNN consistently exhibits the weakest performance across all cities (e.g., MSE of 0.0675 in Harbin), due to its limited capability in modeling long-range temporal dependencies in hourly temperature data.
The margin of improvement of HA-Informer over other models is slightly reduced in Harbin compared to Niamey and Tehran, likely due to Harbin’s more complex and highly variable cold climate patterns, which increase prediction difficulty for all models. Nevertheless, the proposed model maintains its superiority across all locations, demonstrating its robustness and effectiveness for temperature forecasting in diverse climatic conditions. These results confirm that incorporating attention-based mechanisms with hierarchical aggregation significantly enhances forecasting accuracy compared to both conventional and state-of-the-art deep learning approaches.
The visual results further support these findings. In
Figure 1, the predicted temperature curves generated by the HA-Informer model closely follow the ground-truth observations across all locations, with minimal deviation even during periods of rapid temperature change. This demonstrates the model’s ability to capture both short-term fluctuations and long-term trends.
The scatter diagram presented in
Figure 2 illustrates the predictive performance of the HA-Informer model for temperature forecasting on the test dataset across three diverse climatic locations: Niamey, Tehran, and Harbin. Each subplot compares the model’s predicted temperatures against the actual observed values, with points closely clustered around the diagonal line indicating high accuracy. Niamey shows an exceptionally strong fit with an
of 0.977, suggesting that the model captures nearly all variance in temperature for this hot, semi-arid region. Tehran, with an
of 0.955, exhibits slightly more scatter yet still demonstrates excellent predictive capability, reflecting the model’s robustness in a continental climate with wider seasonal swings. Harbin achieves an
of 0.971, indicating a very high degree of alignment despite the challenges posed by its cold, harsh winters. Overall, the consistently high
values across these distinct climatic zones confirm that the HA-Informer model generalizes well geographically, offering reliable temperature predictions from arid to subarctic conditions.
Figure 3 presents the box plot of prediction errors for the proposed HA-Informer model on the test set. The error distribution exhibits a narrow interquartile range (IQR), indicating stable and consistent performance across the test samples. The mean and median errors are close to zero, showing negligible systematic bias. The limited number of outliers confirms the robustness of the HA-Informer against extreme prediction errors. These results demonstrate that the proposed model not only achieves high accuracy but also maintains reliable and stable predictions across different time intervals.
Table 6 presents the Diebold–Mariano test results as introduced by [
49]. These results provide strong statistical evidence that the proposed HA-Informer significantly outperforms all baseline models across three distinct cities: Niamey, Tehran, and Harbin. The negative DM statistics throughout the table uniformly favor HA-Informer, indicating that its forecasting errors are consistently smaller than those of the comparison models. For the Niamey dataset, HA-Informer demonstrates substantial improvements over the original Informer (
) and shows even more pronounced advantages against DLinear
, LSTM (
), XGBoost (
), transformer (
), the CNN
, and ARIMA (
). The Tehran results follow a similar pattern, with HA-Informer outperforming Informer at the margin of significance (
) while delivering highly significant improvements over all other models, particularly the CNN
and ARIMA (
). The Harbin dataset yields the most compelling results, where HA-Informer surpasses Informer with
and achieves exceptionally large test statistics against the CNN (
) and ARIMA
. Notably, across all three cities, the
p-values for comparisons against the CNN and ARIMA are consistently below
, indicating astronomically significant differences in predictive accuracy. The consistently negative DM statistics across every comparison and every city provide unambiguous empirical evidence that the hybrid attention mechanism, adaptive distillation, and residual refinement decoder collectively yield statistically superior forecasting performance compared to both classical time series methods and state-of-the-art deep learning architectures.
Table 7 presents a computational cost comparison of eight forecasting models applied to the Tehran region, reporting both training time (in minutes) and inference time (in milliseconds per batch with a batch size). Among all models, ARIMA is the fastest to train by a substantial margin, requiring only 0.91234 min, but its inference time of 1.52381 ms/batch is relatively moderate. In contrast, XGBoost offers the most efficient inference at just 0.39155 ms/batch while maintaining a low training cost of 4.52001 min, making it highly suitable for real-time forecasting applications. DLinear and the CNN also demonstrate excellent inference efficiency with 0.50762 ms/batch and 0.61347 ms/batch, respectively, alongside moderate training times of 8.24036 and 12.31158 min. LSTM requires 18.50021 min for training and achieves an inference time of 0.86422 ms/batch, placing it in the mid-range for both metrics. The attention-based models—transformer, Informer, and HA–Informer—are the most computationally expensive, with training times of 42.70316, 28.42019, and 35.83309 min, respectively, and inference times of 2.84512, 2.10344, and 2.27365 ms/batch, all significantly higher than the tree-based and lightweight deep learning alternatives. Overall, the results indicate that for forecasting tasks in the Tehran region, XGBoost, DLinear, and the CNN provide the best balance of low training cost and fast inference, while transformer-based models are considerably more resource-intensive without offering inference-time advantages. Therefore, a key limitation of this study is that while HA-Informer achieved high accuracy, its computational cost exceeds that of simpler models and even standard Informer.