Diffusion–Attention Traffic Generation: Traffic Generation Based on the Fusion of a Diffusion Model and a Self-Attention Mechanism

Wang, Ziyi; Guan, Zhenyu; Liu, Xu; Qiao, Mengyan; Sun, Xuan; Li, Jun

doi:10.3390/electronics14101977

Open AccessArticle

Diffusion–Attention Traffic Generation: Traffic Generation Based on the Fusion of a Diffusion Model and a Self-Attention Mechanism

by

Ziyi Wang

^1,*,

Zhenyu Guan

¹,

Xu Liu

²,

Mengyan Qiao

³,

Xuan Sun

^3,*

and

Jun Li

³

¹

School of Cyber Science and Technology, Beihang University, Beijing 100191, China

²

China Academy of Industrial Internet, Beijing 266101, China

³

College of Computer Science, Beijing Information Science and Technology University, Beijing 102206, China

^*

Authors to whom correspondence should be addressed.

Electronics 2025, 14(10), 1977; https://doi.org/10.3390/electronics14101977

Submission received: 10 March 2025 / Revised: 1 April 2025 / Accepted: 17 April 2025 / Published: 13 May 2025

Download

Browse Figures

Versions Notes

Abstract

Network traffic generation technology plays a critical role in network testing and security protection. It simulates various network traffic patterns, offering intrusion detection systems with diverse samples. These samples enhance the system’s adaptability to emerging traffic types and improve detection capabilities. Additionally, it provides high-quality data for traffic analysis algorithms. These data help optimize model performance by increasing robustness and applicability. However, traditional generative models, like GAN networks, have limited capacity for capturing temporal flow features. As a result, they produce suboptimal generation outcomes. To address this issue, we propose the diffusion–attention traffic generation (DATG) framework. This framework combines a diffusion model with a self-attention mechanism. This integration ensures more accurate simulation of temporal characteristics. The diffusion model’s progressive denoising process guarantees generation stability. Meanwhile, the self-attention mechanism captures global temporal dependencies in traffic sequences. Experimental results validate the superior performance of our DATG by reducing the JSD to 0.18 through the synergistic optimization of the diffusion model and the self-attention mechanism, which is 41.9% better than that of GAN. Additionally, it establishes cross-step-length dependencies through the time dimension self-attention mechanism, which reduces the CRPS value to 0.13, which is 31.6% lower than that of GAN.

Keywords:

traffic generation; deep learning; diffusion model; self-attention mechanism; GAN

1. Introduction

Network traffic generation serves as a critical component in modern network research and applications, providing foundational support for network testing, traffic analysis, and security protection. Particularly in complex network environments, this technology simulates multi-scenario network behaviors that enable protocol optimization, performance evaluation, and attack detection through comprehensive behavioral emulation. Nevertheless, producing realistic, dynamic, and diverse traffic persists as a significant technical challenge.

Traditional approaches primarily depend on statistical models or generator-based architectures that assume predefined probability distributions of traffic characteristics. This assumption limits their ability to capture complex traffic patterns within advanced network environments. Although stochastic process-based methods can simulate enhanced network state complexity, they encounter two critical limitations: excessive computational costs and inadequate handling of high-dimensional data. Deep learning techniques have greatly advanced the field of traffic generation. In particular, deep generative models like Generative Adversarial Networks (GANs) [1] and Variational Autoencoders (VAEs) [2] can capture intricate traffic features and generate traffic. Despite their capabilities, these models still face challenges in generating high-dimensional network traffic data, including training instability, gradient vanishing, and difficulty in capturing global dependencies within traffic sequences.

To overcome the limitations of existing methods, this paper proposes a novel traffic generation approach based on the fusion of a diffusion model [3] and a self-attention [4] mechanism. The diffusion model is a gradient-based technique that generates data by simulating both the forward noise diffusion process and the reverse denoising process. This approach produces results with significant realism and diversity. Compared to GANs, the diffusion model offers advantages in generation stability and the ability to capture complex features. The self-attention mechanism effectively captures global dependencies in complex data. Specifically, for network traffic data, the self-attention mechanism enhances modeling ability by dynamically computing the correlations at each time step in the sequence, improving the relevance of complex processes and compensating for the feature representation limitations of traditional generative methods.

This paper presents several significant contributions, as follows:

Traffic Generation Based on a Diffusion Model: This study employs a diffusion model for traffic generation, ensuring both the stability of the generation process and the diversity of the generated data through a stepwise generation approach. This approach provides an effective solution for the traffic generation task.
Integration of a Diffusion Model and a Self-Attention Mechanism: By integrating the self-attention mechanism with the diffusion model, the DATG introduces global dependency modeling at each step of the generation process, effectively capturing the complex temporal characteristics of traffic sequences and further enhancing the quality of the generated traffic.
Improvement in Traffic Quality and Diversity: Experimental validation confirms that the framework combining the diffusion model and self-attention mechanism demonstrates significant advantages over traditional methods in terms of dynamic variation, temporal consistency, and data diversity.

The subsequent sections of this paper are organized as follows: the next section describes related work on traffic generation using deep learning methods, such as GANs. Section 3 (Prerequisite Knowledge) explains the use of models and evaluation metrics in our work. Section 4 (Proposed Method) explains in detail the specific improvements proposed in our work. Section 5 (Experiment) presents the training results of the model. Finally, Section 6 (Conclusions) concludes this work.

2. Related Work

2.1. Traffic Generation Based on GAN

Deep learning techniques have significantly advanced the field of traffic generation, particularly through deep generative models such as GANs and VAEs. These models are capable of capturing complex flow features and generating traffic flows. After Goodfellow introduced Generative Adversarial Networks (GANs) in 2014, there was rapid development and application in fields such as image and text generation, which led to their natural adaptation for network flow dataset generation. By inputting network flow data into the GAN network and using adversarial training between the generator and the discriminator, realistic network flow data can be generated. Xiao et al. [5] proposed the AdvGAN model, which was the first to achieve the automatic generation of adversarial image samples by training a GAN network. As the technology developed, the research and application of adversarial samples gradually expanded into other fields, including network security. The approaches presented in [6,7] are similar to the AdvGAN model for black-box attacks. They rely on the core concept of training a local surrogate model that approximates the target model, thus achieving the goal of black-box attacks. However, this approach requires extensive interaction with the target model, and the generated adversarial samples show poor transferability. Ring et al. [8] designed three different preprocessing methods based on GAN to convert flow-based data into continuous values and to generate flow-based network traffic using the CIDDS-001 dataset. The dataset was created in a virtual environment using OpenStack, providing rich parameters such as IP addresses, port numbers, TCP flags, and more. Wu et al. [9] proposed a framework based on Deep Reinforcement Learning (DRL) to generate adversarial flows that deceive intrusion detection models. The DRL agent introduces perturbations into malicious flows. A critical limitation persists in current flow generation methods: they fail to characterize the individual packets that comprise network flows. While existing approaches successfully synthesize flow-level statistics (e.g., total packet counts and byte volumes), they systematically ignore packet-level granularity, which is manifested in size distributions and temporal spacing patterns. This coarse-grained synthesis produces artificial traffic patterns that are detectable by modern monitoring systems employing packet-level inspection mechanisms, thereby exposing critical deception vulnerabilities in security testing scenarios.

Dowoo et al. [10] proposed PcapGAN for generating PCAP files, trained on network attack and normal data for intrusion detection. PcapGAN generates packet data that can be analyzed by network packet analysis tools. The data generation process is transparent, allowing for visual assessment of the quality of the generated data, similar to image data. The results showed an improvement in the detection rate of intrusion detection algorithms using these generated data. However, Dowoo et al. only evaluated the accuracy of each GAN discriminator model indirectly and did not provide a comprehensive, quantitative evaluation of the generated packet data. Sina et al. [11] introduced a GAN tunnel for generating traffic that simulates a decoy application. This tunnel encapsulates real user traffic within GAN-generated traffic, protecting the identity of the application generating the network traffic from classification by an internet traffic classifier. The GAN tunnel uses WGAN to generate traffic features with statistical distributions that closely resemble those of actual network traffic, thus masking the statistical properties of user application-generated traffic. Additionally, users can configure WGAN to generate traffic for selected decoy applications. Du et al. [12] utilized dynamic word embeddings and contrastive learning to generate background network traffic in network testbeds. This method intelligently captures the spatio-temporal features of different coarse-grained traffic for characterization. By learning the feature distribution from small samples, the contrastive learning method SimCSE generates high-quality and large-scale background network traffic.

Although the aforementioned methods have improved the quality and realism of traffic generation to some extent, they still have certain shortcomings. For example, deep generative models such as GANs and VAEs face issues like instability during the training process, vanishing gradients, and difficulty in capturing global dependencies in traffic sequences when generating high-dimensional network traffic data. To address the shortcomings of existing methods, this paper proposes a traffic generation approach that fuses the diffusion model with a self-attention mechanism. The diffusion model generates results with significant realism and diversity by simulating both the forward noise diffusion and reverse denoising processes of the data. The self-attention mechanism, on the other hand, captures global dependencies when processing complex data, compensating for the limitations of traditional generative methods in feature representation.

2.2. Emergence of Diffusion Model

With the development of Denoising Diffusion Probabilistic Models (DDPM), diffusion generative models exhibit significant advantages in representation ability, generalization ability, and flexibility. These features have led to the widespread application of diffusion models in many fields, including computer vision, natural language processing, time series analysis, and other applications. Current single-image super-resolution (SISR) problems of excessive smoothing, pattern collapse, and high memory consumption are addressed using SRDiff [13] combined with diffusion generative modeling to transform high-resolution images (HR) into potentially simple distributions by utilizing Markov chains, and then generate predictions for ultra-high-resolution images (SR) in the inverse process. The LR information encoded in a low-resolution image (LR) encoder serves as conditional noise in this process, gradually denoising the high-resolution image step by step. The UNIT-DDPM [14] model combines the DDPM model based on the task of unpaired graph-to, introduces the metadata domain and the target data domain, and forms a joint distribution by matching the denoised fraction of one of the domains by minimizing its distribution, which is updated as a Markov chain, and finally, the final samples are generated by denoising through the Markov chain Monte Carlo method. Wang et al. [15] also applied the DDPM model for semantic image synthesis. They provide the noisy image as input to the encoder of the U-Net structure, while the semantic layout is passed to the decoder through a multilayered spatial adaptive normalization operator. The semantic layout is further improved by introducing a classifier-free guided sampling strategy, which enhances both sampling quality and semantic interpretability.

Diffusion-LM [16] proposes a new non-autoregressive language model based on continuous diffusion, which iteratively denoises Gaussian noise vectors into word vectors, thus generating hierarchically continuous latent relationships between the vectors. DiffuSeq [17] proposes adding an embedding layer that maps discrete text into a continuous representation space. In the inverse process, the model is trained to approximate a sequence of text distributions. For the time series originally interpolated using an autoregressive model, CSDI [18] replaces the autoregressive model with a conditional score-based diffusion model for learning conditional distributions. This model takes observations as conditional inputs, utilizing the information in the observations for the denoising process. A self-supervised approach is applied in the subsequent training process, allowing the observations to separate into conditional information and interpolated targets, compensating for the lack of truth values. The diffusion model offers stability and diversity in generation. Compared with traditional methods, it demonstrates a stronger capability in processing high-dimensional data and complex features, making it particularly suitable for traffic generation tasks that require consideration of global dependencies and complex temporal features.

3. Prerequisite Knowledge

3.1. Diffusion Model

DDPM operates through sequential noise data transformation, implementing a two-stage generation process: forward diffusion and reverse reconstruction. During the initial phase, Markov chain operations systematically degrade input data into Gaussian noise via iterative noise injection across discrete timesteps. The subsequent reconstruction phase employs learned transition kernels to iteratively recover data structures from noise, ultimately synthesizing samples that preserve the statistical fidelity of the training distributions.

The forward process asymptotically transforms the input real image,

x_{0}

, into a pure Gaussian noise image,

x_{T}

, by adding

T

times of noise to the initial image. At each step of the noise addition process,

x_{t - 1}

adds Gaussian noise to generate a new hidden variable,

x_{t}

. The image noise addition process from step

t - 1

to step

t

can be represented by a Gaussian distribution as follows:

q (x_{t} ∣ x_{t - 1}) = Ν (x_{t}; \sqrt{1 - β_{t}} x_{t - 1}, β_{t} I)

(1)

where

\sqrt{1 - β_{t}} x_{t - 1}

denotes the mean of the Gaussian distribution,

β_{t} I

denotes the variance of the Gaussian distribution,

β

is a hyperparameter that increases gradually with

t

, and

I

denotes the unit matrix with the same dimension as the input

x_{0}

samples. The forward process of the diffusion model can be represented as a Markov chain from

t = 1

to

t = T

moments as follows:

q (x_{1 : T} ∣ x_{0}) = \prod_{t = 1}^{T} q (x_{t} ∣ x_{t - 1})

(2)

The reverse process of the DDPM model is also a Markov chain, where the image is generated by learning to predict the variance and mean of the Gaussian distribution during the reverse diffusion process through a neural network that gradually denoises the noisy image.

In network traffic generation, the diffusion model generates traffic data with a high degree of realism and diversity mainly through its denoising process. In traditional traffic generation methods, many models rely on predefined probability distributions and statistical models, which often have limitations in modeling complex network environments. The advantage of the diffusion model is that it is able to introduce dynamic noise into the data generation process, thus making the generated data more closely resemble real network traffic characteristics.

The diffusion model ensures the diversity and stability of the generated results through its gradual generation process. When generating network traffic, the model first gradually adds noise to the real traffic data until the data become pure noise. This process simulates the real-world scenario where the data are constantly perturbed. In the reverse denoising process, the model gradually removes the noise and recovers the original traffic data. This step-by-step feature effectively avoids the mode collapse problem that is common in some traditional generation methods, especially in the generation of complex traffic patterns. The diffusion model can be adjusted and optimized through multiple steps to generate data.

The diffusion model can also perform adaptive generation according to different network scenarios. For example, in the generation of different types of network traffic, such as video streaming, file transfer, etc., the diffusion model dynamically adjusts the noise level and denoising strategy during the generation process according to the characteristics of each type of traffic, generating traffic data that meet the needs of different scenarios. This capability makes the diffusion model show more flexibility and adaptability than traditional methods in producing network traffic with high variability and dynamics. Through the adaptive noise addition process, the diffusion model ensures that the generated data are not only advantageous in terms of diversity but also ensures the temporal consistency of the data. This is especially important in network traffic generation, because traffic data usually contain complex temporal dependencies and traffic patterns, and traditional models often have difficulty handling these long-term dependencies during generation. The gradual denoising process of the diffusion model better preserves the temporal characteristics of the data, allowing the generated traffic data to accurately reflect the evolution and changes in network behavior.

3.2. Self-Attention Mechanism

The self-attention mechanism is a powerful model component that effectively captures long-range dependencies and contextual relationships in sequential data. It was initially widely used in the field of natural language processing, such as text classification, machine translation, and other tasks, significantly enhancing model performance. As illustrated in Figure 1, the self-attention mechanism works by passing the input sequence through a linear mapping layer to obtain the query matrix Q, the key matrix K, and the value matrix V, which are learnable transformation matrices. The attention feature map is then obtained through Equation (3). The similarity between each feature is calculated by performing a dot product between Q and K. The higher the similarity, the greater the product, indicating a higher degree of attention on that feature. A scaling factor,

\sqrt{d_{k}}

, is added after the matrix multiplication to prevent the gradient vanishing problem caused by Softmax. The larger the correlation, the greater the weight of the corresponding V. Finally, all obtained V values are weighted and summed to produce the desired output information.

A t t e n t i o n (Q, K, V) = S o f t m a x (\frac{Q K^{T}}{\sqrt{d_{k}}}) V

(3)

The application of the self-attention mechanism can significantly improve the ability of generative models to model complex temporal dependencies. Network traffic data usually contain many different communication protocols and behavioral patterns, and the dependencies between different time points are complex and diverse. Traditional neural network models, such as convolutional neural networks (CNNs) [19] and long-short-term memory networks (LSTMs) [20], are capable of handling some temporal features but have limitations in dealing with long-distance dependencies and capturing global information. The self-attentive mechanism, on the other hand, captures a wider range of temporal dependencies by calculating correlations between individual data points and dynamically assigning varying weights to each point. The combination of a convolutional neural network and a self-attention mechanism is also known as a novel approach. CvT [21] replaces the original linear mapping layer in the self-attention mechanism. The input sequence passes through three convolutional layers to obtain the Q, K, and V matrices, with step-by-step convolution reducing the computational complexity of obtaining the markers. CPVT [22] and the CSwin Transformer [23] use convolution-based positional encoding to reduce the computational complexity of the model, especially in downstream tasks. CoAtNet [24] uses MBConv [25] in the first two phases of the model to obtain the underlying information of the image and discusses in detail the optimal combination of a convolutional module and a Transformer Block, which shows excellent modeling performance.

In network traffic generation, the temporal relationships of traffic data are very complex, and traditional methods can often only capture local temporal features without effectively modeling global dependencies. The self-attention mechanism is able to dynamically capture the dependencies between distant data points by calculating the correlation between individual elements in the sequence. By weighting the influence of each data point, the self-attention mechanism can effectively capture these long-distance dependencies and improve the accuracy of generated traffic data.

The self-attention mechanism introduces global information at each generation step, thereby improving the temporal consistency of the generated data. In network traffic generation, temporal consistency is crucial for accurately modeling network behavior. For instance, there are often strong temporal relationships between requests and responses, which the self-attention mechanism captures by considering the entire data sequence. This enhances the consistency of the generated traffic over time. Additionally, the self-attention mechanism allows the generative model to identify appropriate dependencies in different traffic types and adjust its computations based on each type’s characteristics, generating network traffic that aligns with real-world scenarios.

3.3. Evaluation Metrics

3.3.1. JSD

Jensen–Shannon Divergence (JSD) is a measure of the similarity between two probability distributions. It is the symmetric version of the Kullback–Leibler (KL) divergence and is commonly used to assess the difference between generated data and real data. The value of JSD ranges from 0 to 1, with a value closer to 0 indicating that the two distributions are more similar, as defined below:

J S D (P ∥ Q) = \sqrt{\frac{K L (P ∥ M) + K L (Q ∥ M)}{2}}

(4)

where

P

denotes the distribution of real data, and

Q

denotes the distribution of generated data.

While traditional traffic generation models often suffer from the mode collapse problem, which causes the generated data distribution to deviate from real scenarios, the DATG framework proposed in this paper significantly addresses this problem by fusing the diffusion model with the self-attention mechanism. The core value of JSD lies in its symmetry and boundedness: by symmetrizing the KL dispersion and introducing a mixture distribution,

M = (P + Q) ∕ 2

, JSD eliminates the bias due to the different distribution orders in the evaluation process, and at the same time, the value domain is bounded between [0, 1] to make cross-model comparisons more intuitive.

In this paper, JSD is used to measure the distance between the generated traffic data from different models and the real traffic data, serving as an indicator of the authenticity of the data generated by these models. A shorter distance indicates that the generated data are more authentic.

3.3.2. CRPS

The Continuous Ranked Probability Score (CRPS) is used to assess the consistency of the generated flow data in terms of probability distribution, particularly when the generated data retain similar probability distribution properties as the real data. This metric measures the correlation between the generated traffic and the real traffic in terms of probability distribution, which is usually achieved by measuring the difference in probability distribution between different points in time. The mathematical definition of CRPS is the L1 distance integral between the cumulative distribution function (CDF) of the generated flow sequence and that of the real flow sequence. For the predictive distribution,

F

, of the generated flow sequence and the true observation,

y

, as defined below:

CRPS (F, y) = \int_{- \infty}^{\infty} {(F (x) - l (x \geq y))}^{2} d x

(5)

where

l (x \geq y)

is an indicator function that takes the value of 1 when

x \geq y

. In practical calculations, if the generated data are a deterministic sample,

\hat{y}

, they can be approximated using the empirical distribution as follows:

C R S P = \frac{1}{N} \sum_{i = 1}^{N} | X_{i} - Y_{i} |

(6)

where

N

is the total number of sample points;

X_{i}

is the ith data point in the real traffic sequence;

Y_{i}

is the ith data point in the generated traffic sequence; and

| X_{i} - Y_{i} |

is the absolute difference between the generated data and the real data. The range of the CRPS value is from 0 to 1; a value closer to 0 indicates that the generated data are closer to the real data, and a value closer to 1 indicates that the generated data are more different from the real data.

CRPS directly reflects the ability to reduce dynamic properties in the time dimension by quantifying the difference between generated data and real data in terms of the cumulative distribution function. Traditional time series generative models (e.g., RNN [26]) are limited by local receptive fields, making it difficult to model sudden traffic spikes or long-period fluctuations. DATG establishes inter-timestep dependencies via the self-attention mechanism, enabling the generated traffic to closely approximate real-world patterns in continuous dimensions like bit-rate variations and delay fluctuations. The advantages of CRPS manifest through strict appropriateness and multimodal compatibility: as a rigorous scoring rule, the minimum score occurs exclusively when the generated and true distributions match exactly, thereby eliminating overfitting or conservative prediction biases. Additionally, its integral formulation processes both discrete events (e.g., packet arrival times) and continuous variables (e.g., transmission rates).

In this paper, JSD and CRPS are chosen as evaluation metrics based on their dual advantages in the traffic generation task. The symmetry property and multi-granularity windowing computational capability of JSD effectively quantify the global distribution matching degree of packet sizes, flow durations, and other features, and its robust design for zero-valued distributions significantly outperforms that of traditional KL dispersion, making it particularly suitable for complex scenarios in which bursty traffic alternates with silent periods. CRPS innovatively captures protocol interaction logic and long- and short-term timing dependencies through cross-layer attention matrix similarity analysis.

4. Proposed Method

4.1. Overall Architecture

The application of diffusion modeling provides stability and diversity in flow generation. The generation process of the diffusion model consists of two main stages: forward diffusion and backward denoising. In the forward diffusion process, noise is gradually added to the original flow data until they become pure noise. This process introduces noise step by step, providing stability to the generation process and ensuring that the generated flow data have some degree of explorability and diversity. In the reverse denoising process, the model gradually removes the noise based on the added noise data, restores the data structure, and generates samples that resemble the real flow. This stepwise generation process of the diffusion model effectively ensures the high quality of the generated traffic data. However, capturing the global temporal dependencies in the flow data remains challenging when relying solely on the diffusion model, which is a shortcoming of the traditional generation model.

To address this problem, we introduce the self-attention mechanism. The self-attention mechanism captures the global dependencies between individual data points in the input sequence, which is especially important for modeling complex time series data. During the flow generation process, the self-attention mechanism weights the data at each moment, calculates the correlations between time steps, and weights the data based on these correlations to perform a weighted sum. This effectively captures the correlations between different time points, ensuring the consistency of the generated traffic data in terms of the time series. The overall architecture of the DATG framework is illustrated in Figure 2.

The overall architecture of the DATG framework consists of three main parts: data preprocessing, diffusion model initialization, and the diffusion process combined with a self-attention mechanism. The data preprocessing part is responsible for extracting features from raw traffic data and performing cleaning and normalization. The diffusion model initialization part sets the noise distribution and parameters and generates the initial noise data. The diffusion process, combined with the self-attention mechanism, is schematically illustrated in Figure 3, generates high-quality traffic data through stepwise denoising and global dependency modeling.

4.2. Diffusion Model and Self-Attention Fusion Mechanism

4.2.1. Data Preprocessing

Data collection and filtering: A large amount of raw traffic data is collected from the network environment and stored in pcap file format. These data may contain various types of application traffic, such as web browsing, video streaming, file transfer and so on. The collected pcap files are preliminarily screened to remove traffic that clearly does not meet the research objectives.

Data cleansing: Deep cleansing is performed on the screened pcap files. Specific operations include removing all broadcast packets, as broadcast packets usually do not contain specific peer-to-peer communication information and contribute minimally to the traffic generation task; removing corrupted packets, which may have incomplete data or incorrect formatting due to transmission errors, etc., and can affect the accuracy and stability of subsequent processing; and removing duplicated packets as required to avoid data redundancy that could interfere with the model training process.

4.2.2. Diffusion Model Initialization

Noise distribution selection and parameter setting: An in-depth analysis of the characteristics of the flow data is conducted to select the appropriate noise distribution. Gaussian noise is chosen to better simulate random fluctuations in flow data. After determining the noise distribution, parameter configuration involves setting the variance and mean values. The variance magnitude directly controls noise intensity: higher variance values enhance exploration capability and data diversity but risk generating distributionally deviant samples, and lower variance values improve generation stability while producing more realistic data with potential diversity reduction. Optimal noise parameters are determined through experimental validation or prior knowledge.

Initial Noise Data Generation and Dimension Matching: Initial noise data are generated based on the characteristic dimensions and sizes of the target traffic data. The traffic data contain multiple feature dimensions, such as packet size, packet arrival time, flow duration, etc., and the data scale of each dimension consists of a certain number of sample points, which generates the noise matrix of the corresponding dimensions and scales, serving as the input to the diffusion model.

4.2.3. Diffusion Processes Combined with a Self-Attention Mechanism

Denoising step size and iteration control: The denoising step size governs noise removal magnitude per iteration. Larger steps accelerate convergence but risk inducing data oscillations or noise overfitting. Smaller steps ensure smoother generation while prolonging convergence time and increasing computational overhead. The iteration count critically impacts data quality and diversity: excessive iterations oversmooth outputs by eroding real flow details, while insufficient iterations retain residual noise, compromising generation efficacy.

Global dependency modeling: The self-attention mechanism calculates the association weights between each data point after denoising at each step. For each time point or each feature dimension data point in the flow data, the self-attention mechanism considers its relationship with all other data points and obtains the weight matrix by calculating the similarity metric. These weights are then applied to perform a weighted summation of the data points, thereby integrating global information into each local element.

In this study, the process of combining the diffusion model with the self-attention mechanism is structured into four main steps to ensure that the generated network traffic data exhibit high quality, diversity, and temporal consistency. The specific combining process is as follows:

Step 1.: The diffusion model denoising process is conducted as follows:

First, the diffusion model ensures the stability of the generation process by gradually denoising the data. We treat the original network traffic data as high-dimensional features, progressively denoising it through the diffusion model until it reaches pure noise. This process simulates the structural degradation from ordered states to chaotic noise, enabling the model to produce traffic data from stochastic initial conditions.

In the reverse denoising process, we use a trained denoising network to gradually remove noise and restore the structure of the real traffic data. This denoising process is not a simple reduction, but rather a gradual removal of noise through each round of network prediction, allowing the generated traffic data to more closely match the distribution and timing characteristics of the original traffic. Each denoising step carries an approximation of the real data structure, and as the denoising process continues, the generated data gradually recover a structure similar to the original traffic data from the random noise while retaining the timing characteristics of the original data. The data distribution is gradually recovered by means of Markov chains. In the

t

step iteration, DATG predicts the noise parameters based on the current noise data,

x_{t}

, using the denoising network,

f_{θ}

, and generates the intermediate denoising result,

x_{t - 1}

, whose conditional probability distribution can be expressed as follows:

p_{θ} (x_{t - 1} |x_{t}) = N (x_{t - 1}; μ_{θ} (x_{t}, t), \sum_{θ} (x_{t}, t))

(7)

where

μ_{θ}

and

\sum_{θ}

are the mean and covariance of the denoising network prediction, respectively. By minimizing the variational lower bound (ELBO) loss function, DATG gradually corrects the statistical bias of the generated data.

Step 2.: The self-attention mechanism is introduced to calculate the attention weights as follows:

After completing denoising, the generated traffic data are fed into the self-attention mechanism. In this step, we determine the importance of each data point in the whole data sequence by calculating the similarity between them. Specifically, we compute the similarity between data points, usually through inner product operations on the query, key, and value matrices. Based on these similarities, we dynamically assign a weight to each data point, indicating the degree of influence that point has on the entire traffic sequence during the generation process.

The introduction of the self-attention mechanism after denoising aims to gradually correct local features, while the global dependency is fused by the self-attention mechanism to avoid the negative impact of noise interference on the attention weights during parallel processing. In this way, the self-attention mechanism can help the model capture complex temporal dependencies and global relationships in the data. For example, in network traffic, the appearance of certain packets may have a profound impact on the characterization of subsequent traffic, particularly factors such as delay or traffic fluctuations. In this context, the self-attention mechanism can dynamically adjust the attention weight of each data point to ensure that the model can accurately capture subtle changes and remote temporal dependencies in the traffic sequence.

Step 3.: Data weighted summation and global dependency fusion are performed as follows:

Based on the weights calculated by the self-attention mechanism, we perform a weighted summation of each data point. The core mechanism operates by assigning higher weights to critical data points, enabling the model to focus on pivotal temporal segments while diminishing irrelevant influences. This weighted fusion integrates global dependencies, ensuring that each data point considers inter-point interactions across the sequence, thus ensuring that the generated traffic data are more consistent with the global timing patterns.

The feature representation,

x_{t - 1}^{'}

, that fuses the global dependencies is generated by weighted summation of the attention weights on the value matrix, V. This process can be formalized as follows:

x_{t - 1}^{'} = x_{t - 1} + L a y e r N o r m (A t t e n t i o n (Q, K, V))

(8)

where

L a y e r N o r m

is the layer normalization operation for stabilizing the training process. This step effectively addresses the problem of insufficient long-term dependence modeling in traditional diffusion models due to Markov assumptions by enhancing the interaction between local features and global context.

For example, during traffic data generation, packets at critical moments contain information essential for subsequent packets. By using weighted summation, the model better captures these dependencies, enhancing both temporal and global consistency in the synthesized data.

Step 4.: The iterative optimization and generation process is as follows:

The fused feature,

x_{t - 1}^{'}

, is used as the input for the next denoising step, and steps 1 to 3 are repeated until the preset iteration is completed. The diffusion model gradually refines the local details of the generated data (e.g., packet size, timestamp accuracy), while the self-attention mechanism continuously optimizes the global timing logic. The combination of the diffusion model and the self-attention mechanism ensures that each iteration not only fine-tunes the local features but also enhances the global dependencies. Each round of denoising and weighted summing operation leads to continuous improvements in the generated data in terms of temporal consistency and global features.

Through multiple iterations, the model can gradually eliminate noise and strengthen the temporal dependencies at the same time. Eventually, after sufficient iterations, the generated traffic data will be of high quality, not only accurately reflecting the temporal fluctuations of network traffic but also maintaining consistency on a global scale, thus providing credible traffic samples for subsequent network analysis and testing.

The denoising result of each step of DATG corrects the global temporal dependence through the self-attention mechanism, the specific interaction is shown in Algorithm 1. After the diffusion model completes denoising in step 7, the intermediate feature,

x_{t - 1}

, which follows the Gaussian distribution,

N (μ_{θ} (x_{t}, t), \sum_{θ} (x_{t}, t))

, is generated. Subsequently,

x_{t - 1}

is fed into the self-attention module, where learnable weight matrices compute the query, key, and value matrices. Global dependency modeling is performed by calculating the attention weight matrix via dot product, scaling, and Softmax normalization. Finally, the attention output is fused with

x_{t - 1}

through a residual connection and layer normalization, stabilizing training while preserving both local denoised features and global temporal dependencies.

Algorithm 1: DATG Generation Process

Input: Initial noise data,

x_{T}

, total diffusion steps,

T

, denoising network,

f_{θ}

, self-attention parameters,

W_{Q}

, W_{K}

, W_{V}

Output: Generated traffic data,

x_{0}

1: Initialization: Generate noise data,

x_{T}

∼

N (0, I)

, from a Gaussian distribution
2: for

t = T

down to 1 as follows:
3: a. Denoising step:
4: Predict mean and covariance via the denoising network as follows:
5:

μ_{θ} (x_{t}, t)

, \sum_{θ} (x_{t}, t)

= f_{θ} (x_{t}, t)

6: Sample the denoised result as follows:
7:

x_{t - 1}

∼

N (μ_{θ} (x_{t}, t), \sum_{θ} (x_{t}, t))

8: b. Self-attention mechanism:
9: Compute the query, key, and value matrices as follows:
10:

{Q = x}_{t - 1} W_{Q}

, {K = x}_{t - 1} W_{K}

, {V = x}_{t - 1} W_{V}

11: Compute the attention matrix as follows:
12:

A t t e n t i o n (Q, K, V) = S o f t m a x (Q K^{T} ∕ \sqrt{d_{k}}) V

13: Layer normalization and residual connection:
14:

x_{t - 1}^{'} = x_{t - 1} + L a y e r N o r m (A t t e n t i o n (Q, K, V))

15: c. Update input:
16:

x_{t - 1} {\leftarrow x}_{t - 1}^{'}

17: return

x_{0}

The training of DATG follows the standard tiny-diffusion framework (tiny-diffusion: https://github.com/tanelp/tiny-diffusion (accessed on 16 April 2025)) with two key enhancements: global dependency modeling via self-attention and residual fusion of attention features into the denoising network. Tiny-diffusion is a lightweight implementation of the diffusion model, designed to offer a simplified model structure and code logic, providing an easy-to-understand, reproducible, and extensible diffusion model benchmark. We selected tiny-diffusion as the baseline model because it strictly follows the classic DDPM framework and offers a transparent, reproducible implementation. Compared to more complex improved models, tiny-diffusion strips away additional optimization strategies, enabling a clear comparison of the core contribution of the self-attention mechanism in DATG. As detailed in Algorithm 2, the model is trained to predict the noise added during the forward diffusion process, while the self-attention mechanism dynamically adjusts the temporal dependencies at each step.

Algorithm 2: DATG Training Process

Input: Real traffic data,

x_{0}

, total diffusion steps,

T

, denoising network,

f_{θ}

, self-attention parameters,

W_{Q}

,

W_{K}

,

W_{V}

, noise schedule,

{β_{t}} (linear from 1 \times 10^{- 4} t o 0.02)

Output: Trained model parameters

θ

,

W_{Q}

,

W_{K}

,

W_{V}

1: Initialization: Initialize

θ

,

W_{Q}

,

W_{K}

, and

W_{V}

with Xavier; set the noise schedule,

β_{t}

2: for each training batch, do:
3: a. Forward Diffusion:
4: Sample

x_{0}

from real data
5: Randomly select

t \sim U n i f o r m (1, T)

6: Compute

α_{t} = \prod_{s = 1}^{t} (1 - β_{s})

7: Add noise:

x_{t} = \sqrt{α_{t}} x_{0} + \sqrt{1 - α_{t}} ϵ, ϵ \sim N (0, I)

8: b. Reverse Denoising with Self-Attention:
9: Predict noise:

ϵ_{θ} = f_{θ} (x_{t}, t)

10: Compute self-attention matrices:
11:

{Q = x}_{t} W_{Q}

,

{K = x}_{t} W_{K}

,

{V = x}_{t} W_{V}

12: Compute attention:
13:

A t t e n t i o n (Q, K, V) = S o f t m a x (Q K^{T} ∕ \sqrt{d_{k}}) V

14: Fuse features:

ϵ_{θ}^{'} = ϵ_{θ} + L a y e r N o r m (A t t e n t i o n (Q, K, V))

15: c. Loss Calculation:

L = ∥ ϵ - ϵ_{θ}^{'} ∥^{2}

16: Compute MSE loss:

x_{t - 1} {\leftarrow x}_{t - 1}^{'}

17: d. Parameter Update:
18: Update

θ

,

W_{Q}

,

W_{K}

,

W_{V}

via Adam

(l r = 1 \times 10^{- 4})

19: end for

In the DATG framework, the self-attention mechanism is embedded in the denoising process of the diffusion model because feature reconstruction during denoising requires the dynamic capture of global temporal dependencies while progressively recovering data structures. As the diffusion model incrementally reconstructs data from noise via Markov chains, the self-attention mechanism calculates temporal association weights after each denoising step (e.g., Algorithm 1, steps 8–14) and injects global information into local features through residual connections. The effectiveness of joint gradient optimization drives the parallel updating of denoising network parameters in the self-attention module.

The Markov assumption in vanilla diffusion models constrains long-range dependency modeling. By integrating self-attention after each denoising step, DATG overcomes this limitation, enabling global context-aware generation adjustments.

5. Experiment

5.1. Dataset

The dataset used in this experiment is ISCXVPN2016 [27], which collects a large amount of network traffic data by simulating a variety of network activities, including but not limited to web browsing, file transfer, and video streaming, in a controlled environment. The dataset covers multiple hours of network traffic and simulates a wide range of network application scenarios, such as web browsing, file transfer, and email. The widely used scenarios include traffic classification, feature extraction, malicious traffic detection, and performance evaluation of intrusion detection systems. Meanwhile, this paper introduces the following publicly available dataset based on ISCXVPN2016, UNSW-NB15 [28], which simulates the traffic of modern web services and tests the model’s adaptability to new protocols.

5.2. Experimental Environment and Parameter Settings

The experimental environment in this paper is based on the Python 3.9.18 and the PyTorch 2.0.0 deep learning framework. The experimental platform includes an NVIDIA GeForce RTX 4090 graphics card, an Intel Core i9-13900 processor, 128 GB of memory, and the Windows Server 2022 Standard operating system. The selected experimental parameters are summarized in Table 1.

In our experiments, we comprehensively evaluate the DATG model’s performance by implementing the following comparative methods:

GAN: As a traditional deep generative model, GAN has a wide range of applications in the field of traffic generation. It generates data through the adversarial training of generators and discriminators and is capable of capturing complex traffic features. By selecting GAN as a comparison object, we conduct evaluations on the enhancement of DATG in terms of the authenticity and diversity of generated traffic compared to traditional methods.

DM: The diffusion model generates data by simulating the forward noise diffusion process and the reverse denoising process. A single diffusion model is chosen as the subject of the ablation experiment to evaluate the improvement of the self-attention mechanism on the performance of the diffusion model.

DATG (the method in this paper): The self-attention mechanism can effectively capture long-range dependencies and contextual connections in sequence data. Combining the self-attention mechanism with the diffusion model can further improve the quality and diversity of the generated traffic.

We conducted comparative experiments at different diffusion steps (T = 500, 1000). The results show that the JSD and CRPS metrics are optimally balanced at T = 1000 steps, and 1000 steps achieve the best trade-off between generation quality and training efficiency.

We tested the effect of different numbers of layers (2, 4, 6) on the quality of the generation. The four-layer model optimizes the balance between the number of parameters and the performance. The six-layer model shows a limited performance improvement (only a 0.01 reduction in JSD), but the number of parameters is increased by 50%, and the training time is extended by 30%.

We compare the performance of different neuron numbers on the ISCXVPN2016 dataset. With a neuron selection of 64, the JSD and CRPS are reduced by 25% and 23.5%, respectively, compared to 32 neurons. The training time is only 61.4% of that for 128 neurons, while the difference in JSD is only 0.02.

Linear scheduling is validated in the diffusion model with a noise rate,

β_{t},

that increases linearly over time, ensuring smooth degradation of the forward process. DATG and DM employ identical linear scheduling to exclude noise strategy interference from experimental results. The GAN parameters are configured symmetrically to the DATG architecture, ensuring comparable computational complexity. When expanding the GAN layers to six, its average JSD = 0.23 remains significantly inferior to DATG, proving that the performance advantage originates from the self-attention mechanism rather than parametric scale.

5.3. Experimental Result

To verify the model’s robustness, we designed an experiment comparing evaluation metrics between traffic generated with and without added noise to the original data. In this experiment, traffic generation was repeated after introducing random Gaussian noise to the original data while keeping other conditions unchanged. Subsequently, the generation results under noise-added/noise-free conditions were compared by assessing robustness performance through the evaluation metrics.

This robustness provides advantages for the model in practical applications. During network traffic collection, data inevitably suffers from transmission noise or device errors, yet it maintains high-quality generation capability. This ensures reliable deployment in complex network environments while enhancing the technique’s adaptability. Future work will explore model performance under higher noise conditions through structural optimizations targeting robustness enhancement. The evaluation results of the three models are summarized in Table 2.

On the NVIDIA GeForce RTX 4090 graphics card, DATG takes about 12 h for a single training session, but the efficiency can be optimized through parallel computing for real-world applications due to its significantly improved generation quality. Future work will explore lightweight model structures to reduce resource consumption.

By comparing these three methods, DATG is comprehensively evaluated for its performance enhancement in terms of authenticity and temporal consistency of generated traffic, as well as the enhancement effect of the self-attention mechanism on the diffusion model. The experimental results show that DATG outperforms GAN and a single diffusion model in terms of both JSD and CRPS metrics, proving the effectiveness and superiority of the fusion of the diffusion model and the self-attention mechanism.

JSD, as a core metric for measuring the similarity between the distribution of generated traffic and real traffic, verifies the breakthrough advantage of the DATG model. On the ISCX-VPN2016 dataset, the traditional GAN model often shows significant deviation in the distribution of protocol types in the generated traffic due to the pattern collapse problem (e.g., HTTPS accounts for 42% in the real data but only 28% in the GAN-generated data), resulting in a JSD value as high as 0.31. Through the synergistic optimization of the diffusion model and the self-attention mechanism, DATG reduces the JSD value to 0.18, which is higher than that of the GAN value. DATG reduces the JSD value to 0.18 by co-optimizing the diffusion model and the self-attention mechanism, which is a 41.9% improvement over the GAN.

The CRPS metric quantifies the fidelity of the generated traffic in terms of its dynamic timing characteristics, and the DATG model achieves a leapfrog optimization in this dimension. Traditional GAN-generated traffic is severely distorted in bursty scenarios (e.g., the initial phase of FTP large file transfer), resulting in a CRPS value of 0.19, whereas DATG establishes a cross-step dependency through the self-attention mechanism in the time dimension, reducing the CRPS value to 0.13, which is 31.6% lower than that of the GAN.

As shown in Figure 4, DATG achieves the lowest final training loss (0.09) and validation loss (0.11) among all models, demonstrating its superior convergence efficiency. In contrast, DM exhibits slower optimization due to the lack of global dependency modeling, with a final validation loss of 0.15. GAN, while capturing certain traffic features, suffers from adversarial instability, resulting in the highest validation loss (0.29) and significant oscillations (±0.15). These results quantitatively validate that the fusion of diffusion and self-attention mechanisms in DATG effectively balances generation diversity and temporal consistency.

Figure 4 shows the variation of training loss and validation loss during the DATG training process. Through observation, it can be seen that the training loss and validation loss exhibit significant decreases in the first few training cycles, which indicates that the model is able to learn effective features quickly at the initial stage. Overall, the model shows good stability and reliability, especially in terms of generation stability and temporal consistency, which further validates the effectiveness of the traffic generation framework based on the fusion of the diffusion model and self-attention mechanism proposed in this paper for improving data quality and diversity.

6. Conclusions

We propose a traffic generation framework (DATG) based on the fusion of a diffusion model with a self-attention mechanism, which effectively addresses the deficiencies of traditional traffic generation methods in terms of dynamics, diversity, and global dependency modeling. Through the progressive denoising process of the diffusion model and the global temporal dependency capture of the self-attention mechanism, DATG generates high-quality and diverse traffic data across multiple scenarios. Experimental results demonstrate that DATG offers significant advantages in generation stability, timing consistency, and data diversity.

Although our proposed model achieves better results in generating network traffic, several challenges remain, and there is room for improvement. Future research can explore applying DATG to more public datasets and assess its generalization ability in different network environments. Current research mainly focuses on generating normal traffic; future work can extend the model to generate malicious traffic, providing intrusion detection systems with more diverse training data. As the complexity of the generation task increases, so does the model’s training time and demand for computational resources. Future improvements can optimize the model structure or use more efficient training methods to enhance training efficiency and generation quality. Applying the diffusion model and self-attention mechanism in real-time network traffic generation will offer more timely and reliable support for network performance testing and security protection. Future efforts will focus on further optimizing the model structure, exploring additional application scenarios, and applying DATG in real network environments to strengthen technical support for network testing and security protection.

Author Contributions

Conceptualization, Z.W. and M.Q.; methodology, Z.W.; validation, X.L., Z.G., X.S. and J.L.; formal analysis, Z.W. and Z.G.; data curation, Z.W., X.L., Z.G. and X.S.; writing—original draft, Z.W. and M.Q.; writing—review and editing, Z.W., Z.G., X.S. and J.L.; supervision, X.S. and J.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Basic Research Project of the Translational Applica-tion Project of the “Wise Eyes Action” (Project No. F2B6A194). We would like to express our deep-est gratitude to these organizations for their generous funding and support.

Data Availability Statement

The data presented in this study are available in this article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative Adversarial Networks. Commun. ACM 2020, 63, 139–144. [Google Scholar] [CrossRef]
Kingma, D.P.; Welling, M. Auto-Encoding Variational Bayes. arXiv 2013, arXiv:1312.6114. [Google Scholar]
Ho, J.; Jain, A.; Abbeel, P. Denoising Diffusion Probabilistic Models. Adv. Neural Inf. Process. Syst. 2020, 33, 6840–6851. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. arXiv 2017, arXiv:1706.03762. [Google Scholar]
Xiao, C.; Li, B.; Zhu, J.-Y.; He, W.; Liu, M.; Song, D. Generating Adversarial Examples with Adversarial Networks. arXiv 2019, arXiv:1801.02610. [Google Scholar]
Dong, Y.; Liao, F.; Pang, T.; Su, H.; Zhu, J.; Hu, X.; Li, J. Boosting Adversarial Attacks with Momentum. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 9185–9193. [Google Scholar]
Nichol, A.; Achiam, J.; Schulman, J. On First-Order Meta-Learning Algorithms. arXiv 2018, arXiv:1803.02999. [Google Scholar]
Ring, M.; Schlör, D.; Landes, D.; Hotho, A. Flow-Based Network Traffic Generation Using Generative Adversarial Networks. Comput. Secur. 2019, 82, 156–172. [Google Scholar] [CrossRef]
Wu, D.; Fang, B.; Wang, J.; Liu, Q.; Cui, X. Evading Machine Learning Botnet Detection Models via Deep Reinforcement Learning. In Proceedings of the ICC 2019–2019 IEEE International Conference on Communications (ICC), Shanghai, China, 20–24 May 2019; pp. 1–6. [Google Scholar]
Dowoo, B.; Jung, Y.; Choi, C. PcapGAN: Packet Capture File Generator by Style-Based Generative Adversarial Networks. In Proceedings of the 2019 18th IEEE International Conference On Machine Learning And Applications (ICMLA), Boca Raton, FL, USA, 16–19 December 2019; pp. 1149–1154. [Google Scholar]
Fathi-Kazerooni, S.; Rojas-Cessa, R. GAN Tunnel: Network Traffic Steganography by Using GANs to Counter Internet Traffic Classifiers. IEEE Access 2020, 8, 125345–125359. [Google Scholar] [CrossRef]
Du, L.; He, J.; Li, T.; Wang, Y.; Lan, X.; Huang, Y. DBWE-Corbat: Background Network Traffic Generation Using Dynamic Word Embedding and Contrastive Learning for Cyber Range. Comput. Secur. 2023, 129, 103202. [Google Scholar] [CrossRef]
Li, H.; Yang, Y.; Chang, M.; Chen, S.; Feng, H.; Xu, Z.; Li, Q.; Chen, Y. Srdiff: Single Image Super-Resolution with Diffusion Probabilistic Models. Neurocomputing 2022, 479, 47–59. [Google Scholar] [CrossRef]
Sasaki, H.; Willcocks, C.G.; Breckon, T.P. UNIT-DDPM: UNpaired Image Translation with Denoising Diffusion Probabilistic Models. arXiv 2021, arXiv:2104.05358. [Google Scholar]
Wang, W.; Bao, J.; Zhou, W.; Chen, D.; Chen, D.; Yuan, L.; Li, H. Semantic Image Synthesis via Diffusion Models. arXiv 2022, arXiv:2207.00050. [Google Scholar]
Li, X.; Thickstun, J.; Gulrajani, I.; Liang, P.S.; Hashimoto, T.B. Diffusion-Lm Improves Controllable Text Generation. Adv. Neural Inf. Process. Syst. 2022, 35, 4328–4343. [Google Scholar]
Gong, S.; Li, M.; Feng, J.; Wu, Z.; Kong, L. DiffuSeq: Sequence to Sequence Text Generation with Diffusion Models. arXiv 2023, arXiv:2210.08933. [Google Scholar]
Tashiro, Y.; Song, J.; Song, Y.; Ermon, S. Csdi: Conditional Score-Based Diffusion Models for Probabilistic Time Series Imputation. Adv. Neural Inf. Process. Syst. 2021, 34, 24804–24816. [Google Scholar]
LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-Based Learning Applied to Document Recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef]
Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
Wu, H.; Xiao, B.; Codella, N.; Liu, M.; Dai, X.; Yuan, L.; Zhang, L. Cvt: Introducing Convolutions to Vision Transformers. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 22–31. [Google Scholar]
Chu, X.; Tian, Z.; Zhang, B.; Wang, X.; Shen, C. Conditional Positional Encodings for Vision Transformers. arXiv 2023, arXiv:2102.10882. [Google Scholar]
Dong, X.; Bao, J.; Chen, D.; Zhang, W.; Yu, N.; Yuan, L.; Chen, D.; Guo, B. Cswin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 12124–12134. [Google Scholar]
Dai, Z.; Liu, H.; Le, Q.V.; Tan, M. Coatnet: Marrying Convolution and Attention for All Data Sizes. Adv. Neural Inf. Process. Syst. 2021, 34, 3965–3977. [Google Scholar]
Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv 2017, arXiv:1704.04861. [Google Scholar]
Elman, J.L. Finding Structure in Time. Cogn. Sci. 1990, 14, 179–211. [Google Scholar] [CrossRef]
Gil, G.D.; Lashkari, A.H.; Mamun, M.; Ghorbani, A.A. Characterization of Encrypted and VPN Traffic Using Time-Related Features. In Proceedings of the 2nd International Conference on Information Systems Security and Privacy (ICISSP 2016), Rome, Italy, 19–21 February 2016; SciTePress: Setúbal, Portugal, 2016; pp. 407–414. [Google Scholar]
Moustafa, N.; Slay, J. UNSW-NB15: A Comprehensive Data Set for Network Intrusion Detection Systems (UNSW-NB15 Network Data Set). In Proceedings of the 2015 Military Communications and Information Systems Conference (MilCIS), Canberra, Australia, 10–12 November 2015; pp. 1–6. [Google Scholar]

Figure 1. Self-attention mechanism workflow.

Figure 2. DATG workflow divided into three modules: the left side is the data preprocessing module, the middle module is the diffusion process and self-attention mechanism combination module, the top of the module is the diffusion processing module, and the bottom side is the self-attention mechanism module. These two modules carry out information exchange and gradient sharing to achieve their function. The bottom right side is the iterative optimization area, which stores and processes the data from the diffusion process and self-attention combination module.

Figure 3. Framework diagram of the diffusion self-attention mechanism. The denoising process in the diffusion model is combined with the self-attention mechanism, which calculates the association weights between individual data points on the data after each denoising step.

Figure 4. Training and validation loss dynamics of DATG, DM, and GAN models: (a) DATG training loss curve; (b) DM training loss curve; (c) GAN training loss curve. Solid lines denote training loss, and dashed lines represent validation loss.

Table 1. Parameter selection.

Parameter	DATG	DM	GAN
Diffusion Steps	1000	1000	—
Model Layers	4	4	4
Neurons per Layers	64	64	64
Noise Scheduling	$β_{t} : 1 \times 10^{- 4} t o 0.02$	$β_{t} : 1 \times 10^{- 4} t o 0.02$	—

Table 2. Comparative results.

Model	JSD (IV)	JSD (UN)	CRPS (IV)	CRPS (UN)
GAN	0.31	0.38	0.19	0.24
DM	0.24	0.29	0.17	0.21
DATG	0.18	0.22	0.13	0.16

Note. The ISCX-VPN2016 dataset is referred to by IV, and the UNSW-NB15 dataset is referred to by UN.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, Z.; Guan, Z.; Liu, X.; Qiao, M.; Sun, X.; Li, J. Diffusion–Attention Traffic Generation: Traffic Generation Based on the Fusion of a Diffusion Model and a Self-Attention Mechanism. Electronics 2025, 14, 1977. https://doi.org/10.3390/electronics14101977

AMA Style

Wang Z, Guan Z, Liu X, Qiao M, Sun X, Li J. Diffusion–Attention Traffic Generation: Traffic Generation Based on the Fusion of a Diffusion Model and a Self-Attention Mechanism. Electronics. 2025; 14(10):1977. https://doi.org/10.3390/electronics14101977

Chicago/Turabian Style

Wang, Ziyi, Zhenyu Guan, Xu Liu, Mengyan Qiao, Xuan Sun, and Jun Li. 2025. "Diffusion–Attention Traffic Generation: Traffic Generation Based on the Fusion of a Diffusion Model and a Self-Attention Mechanism" Electronics 14, no. 10: 1977. https://doi.org/10.3390/electronics14101977

APA Style

Wang, Z., Guan, Z., Liu, X., Qiao, M., Sun, X., & Li, J. (2025). Diffusion–Attention Traffic Generation: Traffic Generation Based on the Fusion of a Diffusion Model and a Self-Attention Mechanism. Electronics, 14(10), 1977. https://doi.org/10.3390/electronics14101977

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Diffusion–Attention Traffic Generation: Traffic Generation Based on the Fusion of a Diffusion Model and a Self-Attention Mechanism

Abstract

1. Introduction

2. Related Work

2.1. Traffic Generation Based on GAN

2.2. Emergence of Diffusion Model

3. Prerequisite Knowledge

3.1. Diffusion Model

3.2. Self-Attention Mechanism

3.3. Evaluation Metrics

3.3.1. JSD

3.3.2. CRPS

4. Proposed Method

4.1. Overall Architecture

4.2. Diffusion Model and Self-Attention Fusion Mechanism

4.2.1. Data Preprocessing

4.2.2. Diffusion Model Initialization

4.2.3. Diffusion Processes Combined with a Self-Attention Mechanism

5. Experiment

5.1. Dataset

5.2. Experimental Environment and Parameter Settings

5.3. Experimental Result

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI