Comparative Analysis of Attention Mechanisms in Densely Connected Network for Network Traffic Prediction

Oh, Myeongjun; Oh, Sung; Im, Jongkyung; Kim, Myungho; Kim, Joung-Sik; Park, Ji-Yeon; Yi, Na-Rae; Bae, Sung-Ho

doi:10.3390/signals6020029

Open AccessArticle

Comparative Analysis of Attention Mechanisms in Densely Connected Network for Network Traffic Prediction

by

Myeongjun Oh

¹

,

Sung Oh

¹

,

Jongkyung Im

¹

,

Myungho Kim

²,

Joung-Sik Kim

²,

Ji-Yeon Park

²,

Na-Rae Yi

² and

Sung-Ho Bae

^1,*

¹

Department of Artificial Intelligence, Kyunghee University, 1732, Deogyeong-daero, Giheung-gu, Yongin-si 17104, Republic of Korea

²

Hanwha, 188 Pangyoyeok-ro, Bundang-gu, Seongnam-si 13524, Republic of Korea

^*

Author to whom correspondence should be addressed.

Signals 2025, 6(2), 29; https://doi.org/10.3390/signals6020029

Submission received: 28 April 2025 / Revised: 9 June 2025 / Accepted: 11 June 2025 / Published: 19 June 2025

Download

Browse Figures

Versions Notes

Abstract

Recently, STDenseNet (SpatioTemporal Densely connected convolutional Network) showed remarkable performance in predicting network traffic by leveraging the inductive bias of convolution layers. However, it is known that such convolution layers can only barely capture long-term spatial and temporal dependencies. To solve this problem, we propose Attention-DenseNet (ADNet), which effectively incorporates an attention module into STDenseNet to learn representations for long-term spatio-temporal patterns. Specifically, we explored the optimal positions and the types of attention modules in combination with STDenseNet. Our key findings are as follows: i) attention modules are very effective when positioned between the last dense module and the final feature fusion module, meaning that the attention module plays a key role in aggregating low-level local features with long-term dependency. Hence, the final feature fusion module can easily exploit both global and local information; ii) the best attention module is different depending on the spatio-temporal characteristics of the dataset. To verify the effectiveness of the proposed ADNet, we performed experiments on the Telecom Italia dataset, a well-known benchmark dataset for network traffic prediction. The experimental results show that, compared to STDenseNet, our ADNet improved RMSE performance by 3.72%, 2.84%, and 5.87% in call service (Call), short message service (SMS), and Internet access (Internet) sub-datasets, respectively.

Keywords:

spatiotemporal data prediction; DenseNet; attention mechanism

1. Introduction

With the rapid growth of mobile data consumption driven by smartphones and 5G networks, network congestion, communication delays, and connection failures often occur in areas with excessive traffic demand [1]. These issues may cause minor inconveniences on normal days. If numerous base stations are deployed to improve the traffic capacity, energy consumption will increase. Therefore, making accurate traffic predictions is important to improve the Quality of Service and enable efficient energy consumption through dynamic resource allocation [2,3,4,5]. Moreover, in critical situations such as wartime, where rapid situational awareness and response are important, delays or disconnections in receiving and transmitting attack warnings can lead to severe consequences, even if the issues occur in a short period. To mitigate these issues, it is crucial to hand over devices to alternative base stations or communication stations seamlessly to ensure stable network communication, even during periods of congestion. For this reason, accurate mobile traffic prediction is important for optimizing network performance and enhancing the user experience [6].

Since mobile traffic data is time series data, traditional time series prediction methods like linear regression [7,8] and support vector machines (SVMs) [9,10,11] try to achieve accurate predictions. However, mobile traffic data has spatial and temporal characteristics. So, it is necessary to consider both spatial and temporal characteristics and their relationship for accurate data prediction. To effectively capture these complicated relationships, more sophisticated modeling techniques are required. In this context, deep learning approaches such as stacked autoencoders and Long Short-Term Memory (LSTM) networks [12,13] have emerged as promising solutions for traffic prediction. However, these methods suffer from information loss during encoding and struggle to capture spatial dependencies. Recently, many studies [14,15,16,17,18,19,20,21] have solved these challenges by applying a convolutional network block structure. This structure enables the model to capture local and global spatial dependencies through convolution layers. And some of them [14,15] use the DenseNet [22] structure, which is free from the gradient-vanishing problem because of its dense connection between layers. Also, they receive inputs separately as hour-based and date-based data, enabling more effective temporal trend prediction. HSTNet [15] further enhances this approach by incorporating deformable convolutions [23] in dense blocks to capture long-range spatial dependencies and introducing an attention module to compute attention scores for input data to weigh data in crucial locations. However, HSTNet calculates an attention map only from inputs in an attention branch and applies directly to prediction, so the branch cannot deal with various intermediate features. Due to this, the model has limitations in reflecting fine-grained details. And the trained model using deformable convolution, which can be considered a kind of attention as it applies convolution to offset-adjusted pixels, may not always be optimal. Since the offsets in deformable convolution are learned through gradients, reaching pixels with large offsets can be challenging. So, there is a potential risk that some important pixels may not be referenced at all, even if they have more valuable information than their neighboring pixels when training.

For this reason, we propose Attention-DenseNet (ADNet), a deep learning model that integrates an attention module into a DenseNet-based framework. We explore optimal strategies for utilizing attention mechanisms to improve traffic prediction performance. We determine the impact of attention module placement and the most effective types of attention mechanisms for different tasks. Our experiments confirm that attention models are very effective when positioned between the last dense module and the final feature fusion module, meaning that the attention module plays a key role in aggregating low-level local features with long-term dependency. Hence, the final feature fusion module can easily exploit both global and local information. And the best attention module is different, depending on the spatiotemporal characteristics of the dataset. Through extensive experiments, we validate that choosing dataset-specific attention mechanisms and configurations leads to optimal performance. The experimental results show that, compared to STDenseNet, our ADNet improved RMSE performance by 3.27%, 3.84%, and 9.52% in call service (Call), short message service (SMS), and Internet access (Internet) sub-datasets, respectively.

Our contributions are as follows:

We propose ADNet, based on STDenseNet, which is designed through searches for optimal types and positions of attention modules.
We conduct comprehensive experiments and several ablation experiments to validate effectiveness. By applying our method, we can achieve remarkable prediction performance compared with existing methods.

The remainder of this paper is organized as follows. Section 2 reviews related work about network traffic prediction using convolutional networks and attention mechanisms. Section 3 analyzes the dataset used in this paper. Section 4 describes the architecture of the proposed method, ADNet. Section 5 presents the experimental settings and results. Section 6 concludes the paper and discusses potential directions for future work.

2. Related Works

2.1. Convolutional Network for Network Traffic Prediction

STDenseNet [14] is a model designed to predict spatiotemporal patterns by leveraging DenseNet [22]. Through the convolution layers in DenseNet, the model captures local spatial features and makes it possible to capture wider spatial features as the model goes deeper. To capture temporal characteristics well with CNN, STDenseNet applies the DenseNet structures separately to two key components: closeness and period. Closeness captures short-term fluctuations in data volume, and period identifies periodic patterns in data fluctuations based on daily trends. It allows the model to make predictions based on the current data trends and preserve past fluctuation patterns simultaneously. The predictions from these components are then combined using parametric fusion, where each output is assigned a learnable weight before final aggregation.

HSTNet [15] extends STDenseNet by incorporating various mechanisms to enhance feature extraction ability. First, it uses external factors such as holiday information as additional inputs, thereby improving its ability to model complex temporal variations. Furthermore, it introduces deformable convolution within the dense blocks, enabling the model to adjust its receptive field dynamically. So, it enables the capture of more complex spatial dependencies effectively. Moreover, HSTNet introduces an attention branch, which modulates the outputs of both components by applying higher weights to pixels with higher traffic density from input data. It tries improving STDenseNet to capture spatiotemporal features well by applying deformable convolution as an attention mechanism and proves that the attention mechanism itself is helpful for accurate prediction.

However, the attention branch in HSTNet obtains a spatial map only from inputs and directly uses it for prediction. This design restricts to reflecting spatio-temporal features from its hidden layers. And deformable convolution, which can be considered as the attention since it calculates convolution only with offset applied pixels, may lead to suboptimal performance. Because its offsets are updated via gradient descent, it is hard to reach pixels with extremely high offsets, potentially overlooking important information even if those pixels have more correlation than their neighbor pixels. However, there are not enough studies that examine the position and type of attention. So, in this paper, we explore the optimal position and search for the optimal attention type for network traffic prediction.

Recent studies no longer rely on CNN solely and instead focus on hybrid architectures that combine CNNs with other neural network architectures. For example, Ref. [24] combines transformer structure and residual convolution networks, Ref. [21] combines 3d convolution layers, and convlstm. We adopted the CNN structure due to its fast inference speed and aim to improve performance by incorporating attention modules.

2.2. Attention Mechanism

The attention mechanism emerged in natural language process tasks [25] and shows outstanding performance [26]. As the attention mechanism proved to be highly effective, it began to be widely used in various vision tasks [27,28]. In computer vision, the attention mechanism imitates human perception [29,30,31], which amplifies important information while suppressing irrelevant details for inference. The early attention mechanism combines normalized feature maps that represent saliency maps [31] or uses reinforcement learning with tiny patches so that the model can predict using only some parts of patches, like human perception [29].

Residual channel attention [30] computes an attention map from input by downsampling and upsampling (soft mask). And the model multiplies it to a feature map obtained from the residual convolution branch. Inspired by residual channel attention, several attention mechanisms are developed. Spatial attention [32] focuses on emphasizing important features based on their spatial locations. By leveraging the spatial relationships within the input feature map, this mechanism induces the model to focus on specific regions, enabling the processing of key features selectively. Channel attention [28,32] prioritizes the relationships between feature map channels, enhancing the importance of significant channels while reducing the influence of less relevant ones. This approach helps the model learn more meaningful feature representations.

Triplet attention [33] is designed to capture both spatial and channel dependencies while maintaining computational efficiency. It independently applies attention mechanisms along two of the three axes—height, width, and channel—and then integrates these to produce a more refined feature representation. Non-local attention [27] considers relationships among all elements within the input data, learning how each element interacts with every other element. This mechanism is inspired by self-attention [26], which has achieved significant success in natural language processing, particularly through transformer models, and has recently been adapted for computer vision tasks, demonstrating strong performance.

Many studies [24,34,35,36,37,38,39,40] are using attention mechanisms in their traffic prediction models, but there is currently no consensus on which type of attention is most effective or where it should be placed within the network architecture. For example, some studies use self-attention, but they differ in how they handle spatiotemporal data [24,34,36] and external data [37] or embedded features [35,40]. Other studies use squeeze and excitation block [39] channel attention and spatial attention [38]. In this work, we explore various attention modules and evaluate their effectiveness when applied to different positions in ADNet architecture. Since different attention mechanisms compute attention weights based on distinct criteria, we conduct experiments utilizing various attention modules to analyze their effectiveness in network traffic prediction.

3. Data Analysis

Mobile traffic data consists of time-series records of communication volume across different media in specific regions. First, we analyze the Telecom Italia dataset [41,42]. This is an open dataset that has been widely used in several studies [14,15,34,43,44,45], giving it strong representativeness. Also, it consists of multiple traffic sub-datasets, Call, Internet, and SMS. We can train the model to predict various cases. These data were collected for 2 months, from 1 November 2013 to 1 January 2014 in Milan. The interval of recording was 10 min. But some records were missing, so we used data for training and analyzing with an interval of an hour by aggregating six of them. The left of Figure 1 shows the number of Call Detail Records (CDRs) regarding SMS, Call, and Internet. In legends of graphs, sms-in denotes the number of received SMS, sms-out denotes the number of sent SMS, call-in denotes the number of received calls, call-out denotes the number of issued calls, and internet denotes the number of CDRs that are generated when the connection lasts for more than 15 min or the user transfers more than 5 MB during a given time interval. And the right of Figure 1 shows the Internet traffic for a random hour in Milan divided by a 100 × 100 grid, where each cell of the grid has a size of 235 × 235 m². As shown on the left of Figure 1, all of the traffic data has a periodicity of approximately 24 h. Also, in the SMS and Call graph, the first 3 days have lower peaks than the other days. This means this data has uneven and semi-periodic temporal characteristics. And, as shown in the right of Figure 1, even if a particular time slot exhibits high traffic, the traffic in that region can differ significantly from other cells. This means this data has uneven spatial characteristics. Therefore, predicting traffic requires considering both these complicated temporal patterns and spatial dependencies.

4. Method

4.1. Overview

Figure 2 shows the framework of ADNet. We can divide the structure of our method into three stages: (1) feature extraction, (2) the attention layer, and (3) parametric fusion. In the feature extraction stage, the model extracts both temporal and periodic features from the input. In the attention layer, it computes attention scores based on the extracted features and applies these scores as weights. Finally, in the parametric fusion stage, the model combines the temporal and periodic features by weighting them with learnable parameters and summing them, allowing it to prioritize either the temporal or periodic components as needed.

4.2. Feature Extraction in Closeness and Period

To consider both spatial and temporal characteristics, ADNet uses a dense block (Figure 3b) to consider spatial dependencies in the convolution kernel in a dense block. It takes two types of inputs for each dense block: one sampled in units of p time steps, and the other sampled in units of q days. This design enables each dense block to specialize in two types of predictions: closeness prediction, which captures short-term temporal trends, and periodic prediction, which models long-term periodic patterns based on date. Additionally, by utilizing dense blocks composed of convolution layers, batch normalization, and ReLU activation functions, the model can leverage deeper network architectures. If we define

X_{C}^{0}

as the initial input, and

f_{L} ()

as L-th dense block, we can express the output of L-th dense block in closeness

X_{C}^{L}

, and the input of this block can be expressed as

X_{C}^{L} = f_{L} (X_{C}^{0} \oplus X_{C}^{1} \oplus \dots \oplus X_{C}^{L - 1}),

(1)

where ⊕ denotes concatenation.

Similarly, the output of the L-th dense block in period

X_{P}^{L}

and input of this block can be expressed as

X_{P}^{L} = f_{L} (X_{P}^{0} \oplus X_{P}^{1} \oplus \dots \oplus X_{P}^{L - 1}),

(2)

The first convolutional block (Figure 3a) and the last convolutional block (Figure 3c) are not densely connected to other dense blocks. The first convolutional block connects input to feature extraction, and the last one connects feature extraction to attention module.

4.3. Attention Module

ADNet weighs important spatiotemporal features by incorporating an attention module. This module is integrated into the existing STDenseNet framework, which is positioned after the blocks that capture temporal and periodic characteristics. By doing so, it strengthens the model’s ability to capture relationships related to temporal proximity and periodicity [15]. The attention module guides the network to focus on essential features by considering interactions between features across both temporal aspects. As a result, the model can better capture complex spatiotemporal dependencies in both temporal proximity and periodic patterns, leading to improved predictions. By enhancing the model’s ability to extract valuable information from these interactions, the accuracy of mobile traffic prediction is significantly improved.

We use several attention mechanisms in our method, including spatial attention, i.e.,

M_{S} (F)

in Equation (3); channel attention, i.e.,

M_{C} (F)

in Equation (4); triplet attention, i.e.,

M_{T} (F)

in Equation (6); and self-attention,

M_{S o f t} (F)

, as shown in Equation (7). Additionally, Ref. [46] has shown that replacing the softmax function in non-local attention with a Gaussian kernel can improve performance. Based on this finding, we also use this modified attention mechanism (Equation (8)). Furthermore, we explore several attention module combinations, such as CBAM [32] and modified triplet attention, by replacing its spatial attention with self-attention to create additional variations. We express the operations of the aforementioned attention modules in the following way:

S p a t i a l A t t e n t i o n : M_{S} (F) = σ (f ([A v g P o o l (F); M a x P o o l (F)])),

(3)

where

f ()

denotes a convolution layer, and each

A v g P o o l ()

and

M a x P o o l ()

denotes an average pooling layer and a max pooling layer.

C h a n n e l A t t e n t i o n : M_{C} (F) = σ (M L P (A v g P o o l (F)) + M L P (M a x P o o l (F))),

(4)

M L P (x) = (W_{1} (R e L U (W_{0} x + b_{0})) + b_{1}),

(5)

where

σ ()

denotes a sigmoid activation function in Equation (4), and

R e L U ()

denotes a Rectified Linear Unit activation function in Equation (5).

T r i p l e t A t t e n t i o n : M_{T} (F) = \frac{1}{3} (\bar{F_{1} \cdot M_{S} (F_{1})} + \bar{F_{2} \cdot M_{S} (F_{2})} + F \cdot M_{S} (F)),

(6)

where

F_{1}

and

F_{2}

denote permuted feature maps from

(C \times H \times W)

to

(W \times H \times C)

and

(H \times C \times W)

for each, and overlines in

\bar{F_{1} \cdot M_{S} (F_{1})}

and

\bar{F_{2} \cdot M_{S} (F_{2})}

denote the permuting tensor to retain the original shape,

(C \times H \times W)

, in Equation (6).

N o n - l o c a l S o f t m a x A t t e n t i o n : M_{S o f t} (F) = s o f t m a x (\frac{F^{T} W_{Q}^{T} W_{K} F}{\sqrt{d_{k}}}) W_{V} F,

(7)

N o n - l o c a l G a u s s i a n A t t e n t i o n : M_{G a u s s} (F) = e x p (\frac{{∥W_{Q} F - W_{K} F∥}^{2}}{2 \sqrt{d_{k}}}) W_{V} F,

(8)

where

W_{Q}

,

W_{K}

, and

W_{V}

denote the weights of query, key, and value, and

d_{k}

denotes the dimension of key tensor (

W_{K} F

) in Equations (7) and (8).

e x p ()

denotes an exponential function in Equation (8).

4.4. Parametric Matrix Based Fusion

ADNet combines the predictions of closeness and periodicity by first weighting them with learnable parameters using element-wise multiplication, then summing the results to obtain the final prediction. This output is subsequently normalized to a range between 0 and 1 using the sigmoid function. This approach allows the model to adaptively balance the importance of closeness and periodicity, which may vary depending on time and location, thereby improving prediction accuracy. The features extracted through L dense blocks are reduced to match the output dimension. The closeness and periodic predictions,

X_{C}^{L + 1}

and

X_{P}^{L + 1}

, respectively, pass through a total of

L + 1

convolution layers and are weighed using the learnable parameters

W_{C}

and

W_{P}

. The final prediction is computed as follows:

\hat{X_{t}} = σ (W_{C} ⊙ X_{C}^{L + 1} + W_{P} ⊙ X_{P}^{L + 1}),

(9)

where ⊙ denotes the element-wise multiplication, and

σ

represents a sigmoid function. The final predicted value

\hat{X_{t}}

serves as the model’s output at time t.

5. Experiment

5.1. Experiment Setting

Table 1 provides an overview of experimental setting, including the configurations of training and testing environments, dataset, and model hyperparameters. Further details are provided in the subsequent sections.

5.1.1. Dataset Preprocessing

In our experiments, we used the Telecom Italia dataset [41], which contains Call, Internet, and SMS data collected in Milan from 1 November 2013 to 1 January 2014. Following [14], we aggregated the data from 10-minute intervals into 1-hour increments. We designated the last 7 days of data as the test set and used the remaining data for training. Additionally, we applied Min-Max normalization to scale the values between 0 and 1, ensuring smoother model training.

Originally, the Telecom Italia dataset is structured as a 100 × 100 grid of regions. However, previous methods have used different optimal cell sizes for training and predicting network traffic. For example, STDenseNet [14] uses a 20 × 20 grid size, whereas HSTNet [15] uses a 100 × 100 grid data. Therefore, to ensure a fair comparison, we conducted experiments with both 20 × 20 and 100 × 100 grid sizes and evaluated the prediction performance accordingly. Also, to ensure consistency with the experimental settings of [15], we added the in-channel and out-channel to form a single channel in 100 × 100 grid experiments, unlike the 20 × 20 grid experiment.

Additionally, we also used the dataset collected in Trentino [41,47]. While the Milan dataset records data in city area, the Trentino dataset handles broader area, so Trentino dataset has sparser data distribution than Milan one. By using this, we can evaluate robustness of our method when used in environments with limited data.

5.1.2. Experimental Environment

We used the same hyperparameters as those provided by the authors in [14,15] to ensure fair comparisons. The model is trained for 100 epochs using mini-batches of size 32. The initial learning rate is set to 0.01 and reduced by a factor of 0.1 at the 50th and 75th epochs. We used the Adam optimizer [48] for training. Each convolution layer has 32 channels. We set p and q to 3 for inputs of the model, which means that the model takes the daily data from past 3 h and daily data from past 3 days to forecast traffic [14]. We downscaled the key, query, and value tensors in non-local attention by a factor of 4 using the softmax function and 2 using the Gaussian kernel to reduce memory usage. We used 3 models for performance comparison, STDenseNet as baseline model, and HSTNet and Multi-view Spatial-Temporal Graph Network (MVSTGN). And we set the values of the cross-domain dataset to 0 to ensure fair comparison. We incorporated attention modules in the attention layer described in Section 4.3. Experiments were conducted using an RTX 3090 GPU, and the code was implemented in PyTorch 2.0.1 [49].

For evaluation, we used Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), and Symmetric Mean Absolute Percentage Error (SMAPE) as the performance metrics. RMSE is a metric that represents the square root of the average squared difference between the actual values and predicted values. Meanwhile, MAE represents the average absolute value of the difference between them. And SMAPE represents the relative error between actual value and predicted value, ranging from 0% to 200%. A lower value indicates a better prediction performance. When large errors occur, RMSE metric increases significantly. Therefore, we used RMSE as the evaluation metric due to its sensitivity to large errors, which can occur in unexpected situations such as traffic surges. And we evaluated with MAE and SMAPE to see how well the model captures absolute and relative errors. The metrics can be expressed as

R M S E = \sqrt{\frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} {(y_{(i, j)} - {\hat{y}}_{(i, j)})}^{2}},

(10)

M A E = \frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} |y_{(i, j)} - {\hat{y}}_{(i, j)}|,

(11)

S M A P E = \frac{100 %}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} \frac{| y_{(i, j)} - {\hat{y}}_{(i, j)} |}{(| y_{(i, j)} | + | {\hat{y}}_{(i, j)} |) / 2},

(12)

where

y_{(i, j)}

denotes the pixel of ground truth located at

(i, j)

, and

{\hat{y}}_{(i, j)}

denotes the pixel of model’s predicted value located at

(i, j)

. Both y and

\hat{y}

have a width W and height H. Each experiment was conducted 100 times with different random seeds, and the final results are presented as the average metrics across all runs. For the Trentino dataset, the experiments were conducted 10 times.

5.2. Result

As shown in Table 2, we found that, on average, 3.72% performance improvement occurred in the Call sub-dataset when using non-local attention using Gaussian kernels applied to Triplet attention, 5.87% was found in the Internet sub-dataset when using non-local attention with the softmax function, and 2.84% was found in the SMS sub-dataset when using spatial attention with kernel size of 5. Especially, when the model uses non-local attention using softmax function, it shows relatively poor performance on the Call and SMS sub-datasets while achieving the best performance on the Internet sub-dataset. This points out that the effectiveness of attention in ADNet may vary depending on the datasets.

As shown in Table 3, although our method performed worse than HSTNet by about 14.14% in the Internet sub-dataset, we found that, on average, 28.42% performance improvement occurred in the Call sub-dataset when using triplet attention with kernel size 5, and 6.62% in the SMS sub-dataset when using non-local attention with the softmax function. Unlike the experiment conducted with 20 × 20 grid, we found that ADNet using non-local attention with the softmax function performs the best on both the SMS and Internet sub-datasets among our methods. We think this is because modifying the sub-datasets from a pair of channels to a single channel altered their characteristics.

As shown in Table 4, although our method is worse than MVSTGN in the Internet sub-dataset, like in Table 3, we found that, on average, 10.00% performance improvement occurred in the Call sub-dataset and 8.61% in the SMS sub-dataset when using CBAM module with kernel size 3.

As shown in Table 5 and Table 6, our method performs noticeably worse than existing methods in SMAPE, unlike RMSE. This suggests that our approach has more difficulty predicting numerically small values as the propotion of such data increases. However, as shown in Table 6, SMAPE was improved compared with the baseline, which indicates that the attention module clearly contributes to performance enhancement. Only partial results are shown for visibility. Full results, including MAE evaluation, are provided in the Supplementary Materials.

5.3. Efficiency of Attention Module

We conducted a study to compare our methods based on the number of parameters and FLOPs. We evaluated the efficiencies of methods with performance, FLOPs (Floating Point Operations), and parameters of each method. Since the range of RMSE varies across datasets, we used the average performance improvement rate compared to the baseline as the evaluation metric. And we added one to set the performance of baseline as a reference point to one. The equation of normalized performance can be expressed as follows:

1 + \frac{1}{N} \sum_{i}^{N} \frac{V_{i}^{b a s e} - V_{i}}{V_{i}^{b a s e}},

(13)

where

i \in \{C a l l, I n t e r n e t, S M S\}

,

N = 3

, and

V_{i}

denote the average RMSE of target method in sub-datasets, Call, Internet, or SMS.

V_{i}^{b a s e}

is the constant that denotes the average RMSE of STDenseNet in sub-datasets, Call, Internet, or SMS.

Figure 4 shows that most attention types contribute to better overall performance. Also it shows the general tendency of improved performance with increasing FLOPs. However, method with significantly larger FLOPs tend to result lower performance. This suggests that lightweight attention modules can provide more efficient trade-off between accuracy and complexity, while overly complex attention do not necessarily lead to meaningful performance gain. In Table 7, the performance of the non-local attention using the softmax function is the best; however, this method is not efficient since it necessitates high costs for FLOPs and parameters. When we changed the softmax function to the Gaussian kernel [46], although the number of model parameters increased, we could reduce the FLOPs and achieve a high-ranking overall performance. When we used triplet attentions, FLOPs and the number of parameters did not increase significantly while achieving better performance than the baseline, making it an efficient approach. Also, when using an attention mechanism with a kernel size of 7, the performance tends to be higher compared to a kernel size of 3, but lower than that with a kernel size of 5. This points out that while capturing spatial features between more distant pixels contributes to performance improvement, increasing the kernel size also includes more irrelevant pixels, which may hinder effective feature extraction.

5.4. Position of Attention Module

We conducted an ablation study to evaluate the impact of the position of attention module. To evaluate the performance, we included the performance of the attention mechanism that achieved the highest performance at each specified position and each sub-dataset and analyzed the tendency based on positional differences. Through this study, we can validate that the model performs best when the attention module is placed on the last of dense block structure (Table 8). We attribute this result to the DenseNet architecture used for feature extraction. When the attention module is applied after feature extraction, it computes attention scores based on the final feature map, which is computed with the input and hidden layers’ outputs. In contrast, when the attention layer is placed before or during feature extraction, the feature map used for attention score computation contains relatively less information, potentially leading to lower performance. Also, we can observe that inserting the attention layer between dense blocks in feature extraction resulted in lower performance compared to placing it before feature extraction. This phenomenon may be explained by the DenseNet architecture in feature extraction. When the dense blocks go deeper, the number of feature maps that the block receives also increases. These inputs consist of both feature maps processed by the attention module and feature maps bypassing it via skip connections. The entanglement of feature maps from different levels may harm the effectiveness of the attention mechanism.

6. Conclusions

In this paper, we propose ADNet, an efficient model for network traffic prediction. We analyze the temporal and structural characteristics of the dataset and enhance the model’s ability to capture spatial features by incorporating the attention module, achieving the best performance when applied after feature extraction. This suggests that the attention module effectively captures long-term dependencies by considering distant local features. Our ADNet outperforms previous methods used in experiments and demonstrates its effectiveness through ablation studies and analyses. Also, we observe that the optimal type of attention varies depending on the specific task within the dataset. In future work, integrating this approach with an automated selection algorithm can further improve performance.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/signals6020029/s1. Table S1. MAE comparison of STDenseNet with different attentions in 20 × 20 grid. Table S2. SMAPE (%) comparison of STDenseNet with different attentions in 20 × 20 grid. Table S3. MAE comparison of STDenseNet with different attentions in 100 × 100 grid. Table S4. SMAPE (%) comparison of STDenseNet with different attentions in 100 × 100 grid. Table S5. MAE comparison of STDenseNet with different attentions in Trentino dataset. Table S6. SMAPE (%) comparison of STDenseNet with different attentions in Trentino dataset. Table S7. RMSE and MAE comparison of traffic prediction models with different attentions in Milan dataset. References [50,51,52,53,54,55] are cited in the Supplementary Materials.

Author Contributions

Conceptualization, M.O. and S.-H.B.; Methodology, M.O.; Software, M.O.; Validation, J.I.; Formal analysis, S.O. and J.I.; Investigation, M.O.; Data curation, S.O.; Writing—original draft, M.O.; Writing—review and editing, S.O., J.I., M.K., J.-S.K., J.-Y.P. and N.-R.Y.; Supervision, S.-H.B. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Korea Research Institute for defense Technology (KRIT) Grant funded by the Defense Acquisition Program Administration (DAPA) (KRIT-CT-22-077, Development of integrated communication terminal and network technology for battlefield adaptive multi-layer communication).

Data Availability Statement

The original dataset used in this study is openly available in the Harvard Dataverse repository at https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/EGZHFV and https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/QLCABU. A modified version of the Milan dataset is available in the GitHub repository at https://github.com/chuanting/STDenseNet (accessed on 27 April 2025).

Conflicts of Interest

Authors Myungho Kim, Joung-sik Kim, Ji-yeon Park, and Na-rae Yi were employed by the company Hanwha. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Park, Y.; Park, A.; Kang, S.; Kim, Y.; Ryu, S. Trends on Development of Network Capacity in LTE-Advanced to Support Increasing Mobile Data Traffic. Electron. Telecommun. Trends 2012, 27, 122–135. [Google Scholar]
Le, D.H.; Tran, H.A.; Souihi, S.; Mellouk, A. An AI-based traffic matrix prediction solution for software-defined network. In Proceedings of the ICC 2021—IEEE International Conference on Communications, Montreal, QC, Canada, 14–23 June 2021; pp. 1–6. [Google Scholar]
Zhu, Y.; Wang, S. Joint traffic prediction and base station sleeping for energy saving in cellular networks. In Proceedings of the ICC 2021—IEEE International Conference on Communications, Montreal, QC, Canada, 14–23 June 2021; pp. 1–6. [Google Scholar]
Wu, Q.; Chen, X.; Zhou, Z.; Chen, L.; Zhang, J. Deep reinforcement learning with spatio-temporal traffic forecasting for data-driven base station sleep control. IEEE/ACM Trans. Netw. 2021, 29, 935–948. [Google Scholar] [CrossRef]
Lin, J.; Chen, Y.; Zheng, H.; Ding, M.; Cheng, P.; Hanzo, L. A data-driven base station sleeping strategy based on traffic prediction. IEEE Trans. Netw. Sci. Eng. 2021, 11, 5627–5643. [Google Scholar] [CrossRef]
Saxena, N.; Sahu, B.J.; Han, Y.S. Traffic-aware energy optimization in green LTE cellular systems. IEEE Commun. Lett. 2013, 18, 38–41. [Google Scholar] [CrossRef]
Liu, B.; Meng, F.; Zhao, Y.; Qi, X.; Lu, B.; Yang, K.; Yan, X. A linear regression-based prediction method to traffic flow for low-power WAN with smart electric power allocations. In Simulation Tools and Techniques, Proceedings of the 11th International Conference, SIMUtools 2019, Chengdu, China, 8–10 July 2019; Song, H., Jiang, D., Eds.; Proceedings 11; Springer International Publishing: Cham, Switzerland, 2019; pp. 125–134. [Google Scholar] [CrossRef]
Sun, H.; Liu, H.X.; Xiao, H.; He, R.R.; Ran, B. Use of local linear regression model for short-term traffic forecasting. Transp. Res. Rec. 2003, 1836, 143–150. [Google Scholar] [CrossRef]
Rizwan, A.; Arshad, K.; Fioranelli, F.; Imran, A.; Imran, M.A. Mobile internet activity estimation and analysis at high granularity: SVR model approach. In Proceedings of the 2018 IEEE 29th Annual International Symposium on Personal, Indoor and Mobile Radio Communications (PIMRC), Bologna, Italy, 9–12 September 2018; pp. 1–7. [Google Scholar]
Cong, Y.; Wang, J.; Li, X. Traffic flow forecasting by a least squares support vector machine with a fruit fly optimization algorithm. Procedia Eng. 2016, 137, 59–68. [Google Scholar] [CrossRef]
Sapankevych, N.I.; Sankar, R. Time series prediction using support vector machines: A survey. IEEE Comput. Intell. Mag. 2009, 4, 24–38. [Google Scholar] [CrossRef]
Tian, Y.; Wei, C.; Xu, D. Traffic flow prediction based on stack autoencoder and long short-term memory network. In Proceedings of the 2020 IEEE 3rd International Conference on Automation, Electronics and Electrical Engineering (AUTEEE), Shenyang, China, 20–22 November 2020; pp. 385–388. [Google Scholar] [CrossRef]
Wang, J.; Tang, J.; Xu, Z.; Wang, Y.; Xue, G.; Zhang, X.; Yang, D. Spatiotemporal modeling and prediction in cellular networks: A big data enabled deep learning approach. In Proceedings of the IEEE INFOCOM 2017—IEEE Conference on Computer Communications, Atlanta, GA, USA, 1–4 May 2017; pp. 1–9. [Google Scholar] [CrossRef]
Zhang, C.; Zhang, H.; Yuan, D.; Zhang, M. Citywide cellular traffic prediction based on densely connected convolutional neural networks. IEEE Commun. Lett. 2018, 22, 1656–1659. [Google Scholar] [CrossRef]
Zhang, D.; Liu, L.; Xie, C.; Yang, B.; Liu, Q. Citywide cellular traffic prediction based on a hybrid spatiotemporal network. Algorithms 2020, 13, 20. [Google Scholar] [CrossRef]
Zhang, J.; Zheng, Y.; Qi, D. Deep spatio-temporal residual networks for citywide crowd flows prediction. Proc. AAAI Conf. Artif. Intell. 2017, 31. [Google Scholar] [CrossRef]
Chen, C.; Li, K.; Teo, S.G.; Zou, X.; Li, K.; Zeng, Z. Citywide traffic flow prediction based on multiple gated spatio-temporal convolutional neural networks. ACM Trans. Knowl. Discov. Data TKDD 2020, 14, 1–23. [Google Scholar] [CrossRef]
Sun, S.; Wu, H.; Xiang, L. City-wide traffic flow forecasting using a deep convolutional neural network. Sensors 2020, 20, 421. [Google Scholar] [CrossRef] [PubMed]
Yang, D.; Li, S.; Peng, Z.; Wang, P.; Wang, J.; Yang, H. MF-CNN: Traffic flow prediction using convolutional neural network and multi-features fusion. IEICE Trans. Inf. Syst. 2019, 102, 1526–1536. [Google Scholar] [CrossRef]
Zheng, C.; Fan, X.; Wen, C.; Chen, L.; Wang, C.; Li, J. DeepSTD: Mining spatio-temporal disturbances of multiple context factors for citywide traffic flow prediction. IEEE Trans. Intell. Transp. Syst. 2019, 21, 3744–3755. [Google Scholar] [CrossRef]
Wang, Z.; Wong, V.W. Cellular traffic prediction using deep convolutional neural network with attention mechanism. In Proceedings of the ICC 2022—IEEE International Conference on Communications, Seoul, Republic of Korea, 16–20 May 2022; pp. 2339–2344. [Google Scholar] [CrossRef]
Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4700–4708. [Google Scholar]
Dai, J.; Qi, H.; Xiong, Y.; Li, Y.; Zhang, G.; Hu, H.; Wei, Y. Deformable convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 764–773. [Google Scholar]
Shen, W.; Zhang, H.; Guo, S.; Zhang, C. Time-wise attention aided convolutional neural network for data-driven cellular traffic prediction. IEEE Wirel. Commun. Lett. 2021, 10, 1747–1751. [Google Scholar] [CrossRef]
Bahdanau, D.; Cho, K.; Bengio, Y. Neural machine translation by jointly learning to align and translate. arXiv 2014, arXiv:1409.0473. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 6000–6010. [Google Scholar]
Wang, X.; Girshick, R.; Gupta, A.; He, K. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7794–7803. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Mnih, V.; Heess, N.; Graves, A.; Kavukcuoglu, K. Recurrent models of visual attention. Adv. Neural Inf. Process. Syst. 2014, 27, 2204–2212. [Google Scholar]
Wang, F.; Jiang, M.; Qian, C.; Yang, S.; Li, C.; Zhang, H.; Wang, X.; Tang, X. Residual attention network for image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 3156–3164. [Google Scholar]
Itti, L.; Koch, C.; Niebur, E. A model of saliency-based visual attention for rapid scene analysis. IEEE Trans. Pattern Anal. Mach. Intell. 2002, 20, 1254–1259. [Google Scholar] [CrossRef]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Misra, D.; Nalamada, T.; Arasanipalai, A.U.; Hou, Q. Rotate to attend: Convolutional triplet attention module. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Virtual Conference, 5–9 January 2021; pp. 3139–3148. [Google Scholar]
Yao, Y.; Gu, B.; Su, Z.; Guizani, M. MVSTGN: A multi-view spatial-temporal graph network for cellular traffic prediction. IEEE Trans. Mob. Comput. 2021, 22, 2837–2849. [Google Scholar] [CrossRef]
Xiao, J.; Cong, Y.; Zhang, W.; Weng, W. A cellular traffic prediction method based on diffusion convolutional GRU and multi-head attention mechanism. Clust. Comput. 2025, 28, 125. [Google Scholar] [CrossRef]
Rao, Z.; Xu, Y.; Pan, S.; Guo, J.; Yan, Y.; Wang, Z. Cellular traffic prediction: A deep learning method considering dynamic nonlocal spatial correlation, self-attention, and correlation of spatiotemporal feature fusion. IEEE Trans. Netw. Serv. Manag. 2022, 20, 426–440. [Google Scholar] [CrossRef]
Ma, X.; Zheng, B.; Jiang, G.; Liu, L. Cellular network traffic prediction based on correlation ConvLSTM and self-attention network. IEEE Commun. Lett. 2023, 27, 1909–1912. [Google Scholar] [CrossRef]
Su, J.; Cai, H.; Sheng, Z.; Liu, A.; Baz, A. Traffic prediction for 5G: A deep learning approach based on lightweight hybrid attention networks. Digit. Signal Process. 2024, 146, 104359. [Google Scholar] [CrossRef]
Yang, H.; Jiang, J.; Zhao, Z.; Pan, R.; Tao, S. STVANet: A spatio-temporal visual attention framework with large kernel attention mechanism for citywide traffic dynamics prediction. Expert Syst. Appl. 2024, 254, 124466. [Google Scholar] [CrossRef]
Chu, L.; Hou, Z.; Jiang, J.; Yang, J.; Zhang, Y. Spatial-temporal feature extraction and evaluation network for citywide traffic condition prediction. IEEE Trans. Intell. Veh. 2023. [Google Scholar] [CrossRef]
Barlacchi, G.; De Nadai, M.; Larcher, R.; Casella, A.; Chitic, C.; Torrisi, G.; Antonelli, F.; Vespignani, A.; Pentland, A.; Lepri, B. A multi-source dataset of urban life in the city of Milan and the Province of Trentino. Sci. Data 2015, 2, 150055. [Google Scholar] [CrossRef]
Telecom Italia. Telecommunications-SMS, Call, Internet-MI. Harvard Dataverse, 2015. V1. Available online: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/EGZHFV (accessed on 6 February 2025).
Wang, X.; Yang, K.; Wang, Z.; Feng, J.; Zhu, L.; Zhao, J.; Deng, C. Adaptive hybrid spatial-temporal graph neural network for cellular traffic prediction. In Proceedings of the ICC 2023-IEEE International Conference on Communications, Rome, Italy, 28 May–1 June 2023; pp. 4026–4032. [Google Scholar] [CrossRef]
Zhao, N.; Ye, Z.; Pei, Y.; Liang, Y.C.; Niyato, D. Spatial-temporal attention-convolution network for citywide cellular traffic prediction. IEEE Commun. Lett. 2020, 24, 2532–2536. [Google Scholar] [CrossRef]
Hu, Y.; Zhou, Y.; Song, J.; Xu, L.; Zhou, X. Citywide mobile traffic forecasting using spatial-temporal downsampling transformer neural networks. IEEE Trans. Netw. Serv. Manag. 2022, 20, 152–165. [Google Scholar] [CrossRef]
Song, B.; Han, B.; Zhang, S.; Ding, J.; Hong, M. Unraveling the gradient descent dynamics of transformers. Adv. Neural Inf. Process. Syst. 2024, 37, 92317–92351. [Google Scholar]
Telecom Italia. Telecommunications-SMS, Call, Internet-TN. Harvard Dataverse, 2015. V1. Available online: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/QLCABU (accessed on 24 May 2025).
Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar] [CrossRef]
Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. PyTorch: An imperative style, high-performance deep learning library. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, Vancouver, WC, Canada, 8–14 December 2019; Number 721. Curran Associates Inc.: Red Hook, NY, USA, 2019; pp. 8026–8037. [Google Scholar]
Campbell, J.Y.; Thompson, S.B. Predicting excess stock returns out of sample: Can anything beat the historical average? Rev. Financ. Stud. 2008, 21, 1509–1531. [Google Scholar] [CrossRef]
Xu, F.; Lin, Y.; Huang, J.; Wu, D.; Shi, H.; Song, J.; Li, Y. Big data driven mobile traffic understanding and forecasting: A time series approach. IEEE Trans. Serv. Comput. 2016, 9, 796–805. [Google Scholar] [CrossRef]
Jiang, W.; He, M.; Gu, W. Internet traffic prediction with distributed multi-agent learning. Appl. Syst. Innov. 2022, 5, 121. [Google Scholar] [CrossRef]
Zhang, C.; Zhang, H.; Qiao, J.; Yuan, D.; Zhang, M. Deep transfer learning for intelligent cellular traffic prediction based on cross-domain big data. IEEE J. Sel. Areas Commun. 2019, 37, 1389–1401. [Google Scholar] [CrossRef]
Zhang, W.; Zhu, F.; Lv, Y.; Tan, C.; Liu, W.; Zhang, X.; Wang, F.Y. AdapGL: An adaptive graph learning algorithm for traffic prediction based on spatiotemporal neural networks. Transp. Res. Part C Emerg. Technol. 2022, 139, 103659. [Google Scholar] [CrossRef]
Guo, S.; Lin, Y.; Feng, N.; Song, C.; Wan, H. Attention based spatial-temporal graph convolutional networks for traffic flow forecasting. In Proceedings of the AAAI Conference on Artificial Intelligence, 27 January–1 February 2019; Volume 33, pp. 922–929. [Google Scholar]

Figure 1. Analysis of (Left) temporal characteristics and (Right) spatial characteristics of Telecom Italia dataset (Internet).

Figure 2. The framework of proposed method. The model receives p inputs to extract closeness features and q inputs to extract period features.

Figure 3. The details of blocks in feature extraction stage. (a) The first convolution block, (b) dense block, and (c) the last convolution block.

Figure 4. The visualization of comparison between FLOPs and parameters for different attention in 100 × 100 grid.

Table 1. Configuration of experiments’ setup.

Component	Configuration
epoch	100
learning rate	0.01
LR decay scale	0.1
LR decay step	50, 75 epochs
optimizer	Adam
dense block	3
input data set (h/day)	3/3
dataset	Milan, Trentino
Model	STDenseNet, HSTNet, MVSTGN

Table 2. RMSE comparison of STDenseNet with different attentions in 20 × 20 grid. The number in brackets denotes standard deviation. The bold numbers indicate the best performances in the table.

Attention	Call	Internet	SMS
STDenseNet (baseline)	16.17 (0.85)	172.48 (9.03)	26.76 (1.02)
MVSTGN	175.33 (17.07)	198.04 (25.21)	31.18 (6.69)
ADNet w. 3 × 3 Spatial	15.90 (1.11)	171.73 (10.34)	26.36 (1.44)
ADNet w. 5 × 5 Spatial	15.84 (1.06)	170.50 (7.37)	26.02 (1.33)
ADNet w. 7 × 7 Spatial	16.05 (1.07)	170.09 (6.53)	26.36 (1.21)
ADNet w. Channel	16.10 (1.24)	174.25 (15.10)	26.72 (1.58)
ADNet w. 3 × 3 CBAM	16.76 (1.79)	181.30 (19.40)	27.22 (2.65)
ADNet w. 5 × 5 CBAM	16.71 (1.86)	181.78 (21.23)	26.95 (2.94)
ADNet w. 7 × 7 CBAM	16.39 (1.54)	180.03 (20.11)	27.02 (2.73)
ADNet w. 3 × 3 Triplet	15.77 (0.76)	168.03 (6.14)	26.39 (0.97)
ADNet w. 5 × 5 Triplet	15.68 (0.76)	168.11 (5.91)	26.41 (1.04)
ADNet w. 7 × 7 Triplet	15.77 (0.89)	168.72 (6.77)	26.37 (1.01)
ADNet w. Non-local Softmax	16.24 (1.84)	162.92 (8.48)	27.49 (4.08)
ADNet w. Non-local Gaussian	15.75 (0.81)	167.40 (7.04)	26.32 (1.09)
ADNet w. Triplet Non-local Softmax	15.81 (1.49)	166.38 (10.14)	28.17 (3.35)
ADNet w. Triplet Non-local Gaussian	15.59 (0.73)	170.23 (7.59)	26.35 (0.95)

Table 3. RMSE comparison of STDenseNet with different attentions in 100 × 100 grid. The number in brackets denotes standard deviation. The bold numbers indicate the best performances in the table. The results of STDenseNet and HSTNet are cited from [15].

Attention	Call	Internet	SMS
STDenseNet (baseline) [15]	17.10	80.51	27.49
HSTNet [15]	16.04	72.72	26.42
MVSTGN	42.28 (1.89)	97.73 (9.07)	28.49 (1.84)
ADNet w. 3 × 3 Spatial	13.78 (2.27)	92.29 (14.02)	27.76 (4.43)
ADNet w. 5 × 5 Spatial	13.10 (1.56)	89.48 (14.92)	26.67 (3.27)
ADNet w. 7 × 7 Spatial	13.28 (2.17)	90.14 (14.91)	27.17 (3.34)
ADNet w. Channel	12.99 (1.75)	111.01 (43.91)	26.89 (4.65)
ADNet w. 3 × 3 CBAM	14.03 (2.37)	100.49 (23.39)	26.52 (3.32)
ADNet w. 5 × 5 CBAM	14.06 (2.24)	99.59 (17.04)	26.75 (3.06)
ADNet w. 7 × 7 CBAM	14.17 (2.02)	107.15 (27.60)	26.39 (3.31)
ADNet w. 3 × 3 Triplet	12.88 (2.13)	86.47 (11.79)	26.67 (3.18)
ADNet w. 5 × 5 Triplet	12.49 (1.65)	85.09 (11.07)	26.06 (3.27)
ADNet w. 7 × 7 Triplet	12.63 (1.47)	86.72 (16.38)	25.83 (2.52)
ADNet w. Non-local Softmax	13.31 (3.57)	78.94 (17.39)	24.78 (4.51)
ADNet w. Non-local Gaussian	12.81 (3.34)	84.70 (14.49)	25.85 (3.06)
ADNet w. Triplet Non-local Softmax	15.68 (4.22)	95.36 (34.74)	28.19 (6.80)
ADNet w. Triplet Non-local Gaussian	12.89 (2.09)	87.66 (18.29)	26.35 (3.50)

Table 4. RMSE comparison of STDenseNet with different attentions in Trentino dataset. The number in brackets denotes standard deviation. The bold numbers indicate the best performances in the table. “-” in the table indicates that we could not obtain results for that model due to implementation issues.

Attention	Call	Internet	SMS
STDenseNet (baseline)	21.18 (1.49)	177.84 (30.67)	37.30 (2.73)
MVSTGN	20.92 (1.32)	123.10 (4.65)	39.67 (1.70)
ADNet w. 3 × 3 Spatial	20.53 (1.56)	179.90 (26.17)	36.98 (2.84)
ADNet w. 5 × 5 Spatial	21.42 (1.19)	180.06 (16.51)	36.86 (2.26)
ADNet w. 7 × 7 Spatial	21.05 (1.26)	183.25 (26.30)	37.96 (1.84)
ADNet w. Channel	20.62 (3.31)	172.85 (37.87)	36.88 (2.69)
ADNet w. 3 × 3 CBAM	19.06 (5.01)	159.06 (49.11)	34.09 (5.89)
ADNet w. 5 × 5 CBAM	19.52 (3.49)	162.38 (44.28)	35.30 (2.35)
ADNet w. 7 × 7 CBAM	21.26 (1.26)	163.93 (35.41)	35.94 (3.21)
ADNet w. 3 × 3 Triplet	20.32 (2.25)	173.22 (29.90)	36.66 (3.30)
ADNet w. 5 × 5 Triplet	19.65 (2.01)	157.56 (32.89)	34.10 (4.00)
ADNet w. 7 × 7 Triplet	19.72 (2.47)	169.53 (26.79)	36.08 (2.49)
ADNet w. Non-local Softmax	-	-	-
ADNet w. Non-local Gaussian	21.10 (1.13)	180.25 (24.44)	37.28 (1.71)
ADNet w. Triplet Non-local Softmax	-	-	-
ADNet w. Triplet Non-local Gaussian	-	-	-

Table 5. SMAPE (%) comparison of STDenseNet with different attention in 100 × 100 grid. The number in brackets denotes standard deviation. The bold numbers indicate the best performances in the table.

Attention	Call	Internet	SMS
MVSTGN	39.30 (7.05)	16.70 (4.90)	35.39 (2.71)
ADNet w. 5 × 5 Triplet	68.78 (8.83)	37.19 (10.50)	63.65 (9.70)
ADNet w. Non-local Softmax	76.24 (10.08)	34.09 (7.44)	73.41 (10.07)

Table 6. SMAPE (%) comparison of STDenseNet with different attention in Trentino dataset. The number in brackets denotes standard deviation. The bold numbers indicate the best performances in the table.

Attention	Call	Internet	SMS
STDenseNet (baseline)	152.08 (10.66)	129.01 (11.33)	153.33 (12.36)
MVSTGN	54.72 (10.97)	12.13 (0.59)	39.90 (4.43)
ADNet w. 3 × 3 CBAM	147.17 (9.15)	128.61 (11.90)	143.21 (9.83)
ADNet w. 5 × 5 Triplet	144.41 (9.68)	123.55 (6.26)	139.47 (7.66)

Table 7. Comparison of FLOPs, MACs, and parameters for different attentions in 100 × 100 grid. The bold numbers indicate the best performances, while the underlined numbers indicate the second-best in the table.

Attention	Relative Performance	FLOPs (GFLOPS)	Params (K)
STDenseNet	1.000	30.04	47.48
MVSTGN	0.416	33.72	284.19
ADNet w. 3 × 3 Spatial	1.013	30.06	47.52
ADNet w. 5 × 5 Spatial	1.051	30.10	47.59
ADNet w. 7 × 7 Spatial	1.038	30.17	47.68
ADNet w. Channel	0.961	30.13	48.72
ADNet w. 3 × 3 CBAM	0.989	30.15	52.44
ADNet w. 5 × 5 CBAM	0.989	30.19	52.70
ADNet w. 7 × 7 CBAM	0.960	30.25	53.08
ADNet w. 3 × 3 Triplet	1.068	30.10	47.61
ADNet w. 5 × 5 Triplet	1.088	30.20	47.80
ADNet w. 7 × 7 Triplet	1.082	30.34	48.09
ADNet w. Non-local Softmax	1.113	145.23	56.97
ADNet w. Non-local Gaussian	1.086	41.99	66.32
ADNet w. Triplet Non-local Softmax	0.958	311.13	97.67
ADNet w. Triplet Non-local Gaussian	1.066	77.02	147.32

Table 8. Performance comparison of STDenseNet with different positions in 100 × 100 grid. The number in brackets denotes standard deviation. The bold numbers indicate the best performances in the table.

Position	Call	Internet	SMS
Before feature extraction	13.86 (2.05)	88.04 (14.23)	26.28 (3.58)
Between feature extraction	13.86 (1.76)	94.22 (19.06)	27.29 (3.07)
After feature extraction (ours)	12.51 (1.75)	84.31 (13.51)	25.80 (2.42)

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Oh, M.; Oh, S.; Im, J.; Kim, M.; Kim, J.-S.; Park, J.-Y.; Yi, N.-R.; Bae, S.-H. Comparative Analysis of Attention Mechanisms in Densely Connected Network for Network Traffic Prediction. Signals 2025, 6, 29. https://doi.org/10.3390/signals6020029

AMA Style

Oh M, Oh S, Im J, Kim M, Kim J-S, Park J-Y, Yi N-R, Bae S-H. Comparative Analysis of Attention Mechanisms in Densely Connected Network for Network Traffic Prediction. Signals. 2025; 6(2):29. https://doi.org/10.3390/signals6020029

Chicago/Turabian Style

Oh, Myeongjun, Sung Oh, Jongkyung Im, Myungho Kim, Joung-Sik Kim, Ji-Yeon Park, Na-Rae Yi, and Sung-Ho Bae. 2025. "Comparative Analysis of Attention Mechanisms in Densely Connected Network for Network Traffic Prediction" Signals 6, no. 2: 29. https://doi.org/10.3390/signals6020029

APA Style

Oh, M., Oh, S., Im, J., Kim, M., Kim, J.-S., Park, J.-Y., Yi, N.-R., & Bae, S.-H. (2025). Comparative Analysis of Attention Mechanisms in Densely Connected Network for Network Traffic Prediction. Signals, 6(2), 29. https://doi.org/10.3390/signals6020029

Article Menu

Comparative Analysis of Attention Mechanisms in Densely Connected Network for Network Traffic Prediction

Abstract

1. Introduction

2. Related Works

2.1. Convolutional Network for Network Traffic Prediction

2.2. Attention Mechanism

3. Data Analysis

4. Method

4.1. Overview

4.2. Feature Extraction in Closeness and Period

4.3. Attention Module

4.4. Parametric Matrix Based Fusion

5. Experiment

5.1. Experiment Setting

5.1.1. Dataset Preprocessing

5.1.2. Experimental Environment

5.2. Result

5.3. Efficiency of Attention Module

5.4. Position of Attention Module

6. Conclusions

Supplementary Materials

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI