ET-Mamba: A Mamba Model for Encrypted Traffic Classification

Jian Xu; Liangbing Chen; Wenqian Xu; Longxuan Dai; Chenxi Wang; Lei Hu

doi:10.3390/info16040314

,

and

¹

State Grid Jiangxi Electric Power Research Institute, Nanchang 330096, China

²

School of Information and Computer Engineering, Jiangxi Normal University, Nanchang 330022, China

^*

Author to whom correspondence should be addressed.

Information2025, 16(4), 314;https://doi.org/10.3390/info16040314

This article belongs to the Special Issue Machine Learning and Data Mining for User Classification

Version Notes

Order Reprints

Abstract

With the widespread use of encryption protocols on network data, fast and effective encryption traffic classification can improve the efficiency of traffic analysis. A resampling method combining Wasserstein GAN and random selection is proposed for solving the dataset imbalance problem, and it uses Wasserstein GAN for oversampling and random selection for undersampling to achieve class equalization. Based on Mamba, an ultra-low parametric quantity model, we propose an encrypted traffic classification model, ET-Mamba, which has a pre-training phase and a fine-tuning phase. During the pre-training phase, positional embedding is used to characterize the blocks of the traffic grayscale image, and random masking is used to strengthen the learning of the intrinsic correlation among the blocks of the traffic grayscale image. During the fine-tuning phase, the agent attention mechanism is adopted in the feature extraction phase to achieve global information modeling at a low computational cost, and the SmoothLoss function is designed to solve the problem of the insufficient generalization ability of cross-entropy loss function during training. The experimental results show that the proposed model significantly reduces the number of parameters and outperforms other models in terms of classification accuracy on non-VPN datasets.

Keywords:

encrypted traffic classification; Mamba; loss functions; unbalanced dataset; agent attention

1. Introduction

In this digital era, network traffic is growing rapidly and becoming an important resource for network security research. With the improvement in privacy protection awareness and the wide application of encryption technology, the proportion of encrypted traffic is increasing, which increases the difficulty of traffic monitoring and identification. Encrypted traffic classification can determine the traffic category, which is conducive to the further monitoring and identification of encrypted traffic.

We propose an encrypted traffic classification model, i.e., Encrypted Traffic Mamba (ET-Mamba), based on the ultra-low parametric quantity model Mamba [1], and the main innovations associated with this model are as follows:

(1) A resampling method combining Wasserstein GAN [2] and random selection is designed to compose a new dataset with more balanced categories. By generating minority class samples with a portion of randomly selected majority class samples, the resampling method avoids the following problem: the classifier tends to learn the categories with a larger number of samples while ignoring the categories with a smaller number of samples.

(2) A framework for encrypted traffic classification is proposed based on the encoder and decoder of Mamba. The framework is structured to have not only a pre-training stage, including the encoder of the Mamba but also a fine-tuning stage, including the decoder of the Mamba.

(3) During the pre-training phase, the positional embedding for the blocks of the traffic grayscale image is used for feature representation, and the random masking of coding is used to reinforce the learning ability of contextual associations. During the fine-tuning phase, the agent attention mechanism is adopted for feature extraction to achieve global information modeling at a low computational cost.

(4) The SmoothLoss function is designed to reduce the sensitivity of the model to noise by introducing label smoothing combined with L2 regularization to suppress parameter overfitting and to improve the generalization ability and stability of the model.

The structure of the remainder of the paper is as follows: Section 2 provides an overview of the latest approaches for encrypted traffic classification. Section 3 provides our method and module design. Section 4 introduces our comparative experiment and ablation experiment design. Section 5 introduces our study summary, shortcomings, and prospects.

2. Related Work

In the past two decades, port-based classification [3], DeepPacket detection [4], and machine learning methods [5] have made significant progress. However, due to the ever-growing complexity of the Internet, new encryption protocols, and advanced defense strategies, they are now facing the problem of reduced efficiency. The traditional model has been unable to effectively deal with the problem of encrypted traffic classification. Thus, some researchers have begun to use deep learning to classify encrypted traffic.

The deep learning approach has been used for encrypted traffic classification over the traditional machine learning approach. Deep learning was used for the first time to achieve encrypted traffic classification by Wang et al. [6]; the first 784 bytes of traffic of the original packet were used as model inputs, and a one-dimensional convolutional neural network was used for feature extraction, and the classification accuracy was significantly improved compared to that of the traditional machine learning methods. The recurrent neural network (RNN) was chosen to construct the model [7] due to its advantage in sequence modeling, and it can extract potential temporal features from the encrypted traffic and improve the classification effect of encrypted traffic. Based on the analysis of why convolutional neural networks can deal with the encrypted traffic classification problem, long- and short-term memory networks also obtained a good classification effect [8], but the model depth was deeper. A graph neural network (GNN) model has been proposed [9], which combines the raw bytes of a packet and the relationship between packets to improve the accuracy of encrypted traffic classification by taking full advantage of the non-Euclidean properties and temporal relationships of network data. Some combination models have been proposed, such as CNN combined with RNN [10] and CNN combined with Transformer [11]. Based on the large model, ET-BERT [12] was proposed, which is a BERT-based encrypted traffic classification model that can pre-train deep context datagram-level traffic representations from large-scale unlabeled data with simple fine-tuning of the labeled data for the classification task and obtained good classification results, but with a large number of parameters. NetMamba [13], an efficient linear-time SSM proposed by Wang, is a specially selected and improved unidirectional Mamba structure. Nevertheless, despite promising outcomes, the Mamba model remains in its nascent stage of development, with widespread adoption and large-scale applied research yet to materialize. In addition, Wang did not consider the impact of unbalanced data. Encrypted traffic class imbalance is common for the class imbalance problem in encrypted traffic classification; an integrated learning approach is used to deal with the class imbalance problem [14], which combines the prediction results of multiple classifiers through integrated learning to improve the overall classification effect. A treatment combines cross-entropy loss and FocalLoss loss to widen the loss gap between categories by adjusting the weights so that the model focuses more on minority class samples [15]. An Imbalanced Generative Adversarial Network (IGAN) has been used to generate minority class samples to solve the class imbalance problem [16].

Most of the current deep learning-based encrypted traffic identification methods achieve encrypted traffic identification and classification by increasing the network depth or fusing multiple neural network models together; although this can achieve good results, most of these classification models are complex in structure, have many parameters, and require huge computational overhead. Existing methods used to resolve the problem of class imbalance in encrypted traffic datasets encounter two main problems; one is to assign higher weights to minority class samples through integrated learning or using loss functions, which is likely to result in high model complexity or overfitting risk, and the other is to use generative adversarial networks to generate minority class samples to balance the dataset, without censoring the majority of class samples, and the quality of the generated samples are of varying quality.

3. Methods

3.1. Balancing in Data Processing

The USTC-TFC2016 [17], ISCX-VPN2016 [18], and ISCX-Tor2016 [19] public datasets are used in our study of encrypted traffic classification. In order to convert the raw traffic data into an input format that the deep learning model can accept, the data packet is transformed into a 28 × 28-pixel grayscale image [20]. Firstly, based on the five-tuple information (source IP address, source port number, destination IP address, destination port number, and protocol), SplitCap is used to split the pcap files in the original dataset according to the session. Secondly, there are many duplicate files in the split traffic, and the hash values of these files are the same. The FindDuplicate software of version v1.04 is used to calculate the hash values of these files, and then, the duplicate files with the same hash values are deleted. Thirdly, the length of the UDP header is fixed at 8 bytes, while the length of the TCP header is fixed at 20 bytes; thus, it is necessary to add twelve 0 bytes to the UDP header. Considering that most data packets are less than 784 bytes, the first 784 bytes of the packet are converted into grayscale images, with an extension of 0 to 784 bytes for the packets that are less than 784 bytes and a truncation to 784 bytes for the packets that are greater than 784 bytes. Finally, the MAC address of the data link layer and the IP address of the IP layer are replaced with zeros.

All the three datasets are category-imbalanced, as is seen in Table 1, Table 2 and Table 3. In the USTC-TFC2016 dataset, the FaceTime file size is 2 MB, while the Weibo file size is 1.6 GB. In the ISCX-VPN2016 dataset, the email file size is 21 MB, while the file is 17.6 GB. In the ISCX-Tor dataset, the email file size is 16 MB, while the file is 13 GB. There is a significant imbalance in the sample categories of the dataset, which causes the model to learn the categories with more samples and ignore the categories with fewer samples during the training process, which leads to a decrease in the generalization ability and classification accuracy of the model.

Table 1. Sample file sizes in the ISCX-VPN dataset.

Table 2. Sample file sizes in the ISCX-Tor dataset.

Table 3. Sample file sizes in the USTC-TFC dataset.

This study proposes a resampling method combining Wasserstein GAN and random selection for resolving the sample category imbalance in a dataset (Figure 1), and this method includes an oversampling process based on the Wasserstein GAN network and an undersampling process based on the random selection method. In the oversampling process, the minority class samples are fed into the Wasserstein GAN, and the virtual samples of the corresponding classes are generated, thus expanding the number of minority class samples; in the undersampling process, a portion of the samples are randomly selected from the majority class samples. After resampling, the number of samples in each category is 10,000, and the samples in each category are divided according to training/validation/testing = 8:1:1.

Figure 1. Resampling flowchart for Wasserstein GAN combined with stochastic.

3.2. ET-Mamba Model

3.2.1. Model Design

The overall framework of the encrypted traffic classification model ET-Mamba proposed in this study, based on the Mamba architecture, is shown in Figure 2, including its pre-training phase and its fine-tuning phase. This model built in the pre-training phase uses location coding for raw feature representation and a Mamba encoder for feature extraction; the Mamba decoder and agent attention mechanism are used for feature extraction in the model built in the fine-tuning phase. During the model training process, pre-training is carried out to obtain the hidden space representation of the encrypted traffic; then, the hidden space representation is fine-tuned to enhance the feature representation of the encrypted traffic and improve the model performance. Meanwhile, the SmoothLoss function is designed in the model to solve the problem of the insufficient generalization ability of the cross-entropy loss function during training.

Figure 2. Framework of ET-Mamba.

The input to the model is a grayscale image of the encrypted traffic data [20], which has a size of 28 × 28 pixels, and the input image is sliced using a convolution with a convolution kernel of 4 × 4 and a step size of 4. The image is divided into 49 image blocks, which retain the local features of the original data. Positional coding helps the model to understand the arrangement of the image blocks and the relative positional relationships between them, which is the key to capturing the spatial structural features among the image blocks. The model encodes each image block by positional embedding to ensure that the image blocks retain their spatial position information during the encoding process. For the model to accurately understand the correlations between image blocks, a random masking approach is introduced in the pre-training phase, which trains the model to predict the ability to mask image blocks by randomly masking a portion of the image blocks. The random masking approach forces the model to focus on the local features of the image blocks and the global relationships between the image blocks, thus enabling the model to perform effective recognition despite incomplete or missing data. After positional embedding and random masking, the grayscale image block of encrypted traffic can be represented as Equation (1)

x_{e m b}^{i} = x^{i} + P E (i) + R M (i), i = 1, \dots 49

(1)

In the above equation,

x^{i}

is the

i

th image block,

P E (i)

is the positional embedding representation of the

i

th image block, which typically uses sequential positional encoding, and

R M (i)

is a randomly generated mask used to mask out parts of the image block.

The value

x_{e m b}^{i}

is encoded using a Mamba encoder to obtain the hidden space representation

z_{i}

according to Equation (2)

z_{i} = f_{e n c o d e r} (x_{e m b}^{i}, θ_{e n c o d e r})

(2)

In the above equation,

f_{e n c o d e r}

encoder is the mapping function of the Mamba encoder, which generates the latent space representation

z_{i}

based on the input

x_{e m b}^{i}

and the parameters

θ_{e n c o d e r}

of the Mamba encoder. The core coding part is composed of multiple Mamba blocks. Each block integrates RMSNorm normalization to improve numerical stability, residual connection to alleviate the problem of deep network training, and droppath random depth mechanism to enhance generalization ability. In combination with RMSNorm normalization, it ensures the extraction of the depth feature of

x_{e m b}^{i}

and finally generates a meaningful hidden space representation

z_{i}

to achieve efficient mapping from input to potential features.

The goal of the fine-tuning phase is to utilize the latent space representation

z_{i}

learned during the pre-training phase, that is, the reconstruction block, to further optimize the model to adapt to specific downstream tasks of traffic classification. During the fine-tuning process, the reconstruction block is fed into the Mamba decoder, which converts it into feature vectors

o_{i}

, as shown in Equation (3)

o_{i} = f_{d e c o d e r} (z_{i}, θ_{d e c o d e r})

(3)

where

f_{d e c o d e r}

is the mapping function of the decoder that generates the output

o_{i}

based on the reconstruction block

z_{i}

and the decoder

θ_{d e c o d e r}

parameters. The

θ_{d e c o d e r}

parameters include the learnable parameters of each core component of the decoder, including the weight and bias of the linear layer, which is responsible for mapping the encoder output

z_{i}

to the decoder target dimension, the weight bias of the SSM layer, and the feedforward network, and the normalized parameters to normalize the feature distribution.

The further feature extraction of

o_{i}

using the agent attention mechanism allows the model to dynamically focus on the parts of

o_{i}

that are relevant to a particular task and suppress irrelevant information to extract a more accurate feature representation, as in Equation (4)

A t t e n t i o n (z) = \sum_{i = 1}^{m} α_{i} \cdot o_{i}

(4)

where

α_{i}

is calculated and generated by a specific query vector

q_{i}

to measure the importance of each information unit in the decoder output to a specific task,

o_{i}

is the output of the decoder, and

m

represents the number of feature items involved in attention calculation.

The fully connected layer of the model maps the obtained feature representations weighted by the agent’s attention to the probability distribution

n

for each category as in Equation (5)

n = s o f t m a x (W \cdot A t t e n t i o n (z) + b)

(5)

where

W

is the weight matrix of the fully connected layer,

b

is the bias term, and

n

represents the probability distribution of each category. Softmax is commonly used for multi-classification problems. It can convert a real number vector into a probability distribution, that is, convert each element in the vector into a value between 0 and 1, and the sum of all elements is 1.

3.2.2. Random Masking

The purpose of random masking is to artificially introduce information loss, preventing the model from relying on the simple extrapolation of adjacent information and thereby forcing the model to deeply explore the intrinsic features of encrypted traffic. After positional encoding, 1/7 of the encrypted traffic blocks are randomly masked. Here, the original encrypted traffic image block size is 28 × 28, and after convolution with a kernel size of 4 × 4 and a stride size of 4, the image is processed into 49 encrypted traffic image blocks of 4 × 4, so a size of 1/7 means 7. The specific masking generation operation is described in Equation (6)

P_{v i s}^{i} = S h u f f l e (x_{e m b}) [1 : 42]

(6)

After completing the positional embedding, the Shuffle function randomly rearranges the encrypted traffic image blocks. In the actual transmission process of encrypted traffic data, there may be certain patterns or sequences. If the model directly learns this inherent order, it may overly rely on these surface patterns and fail to deeply explore the intrinsic features of the data. The Shuffle function randomly shuffles the order of data blocks, disrupting these possible simple patterns and increasing the diversity and complexity of the data. When processing data, the model needs to learn from the content, features, and complex relationships between data blocks, which promotes the model to learn more universal and robust feature representations and improves its generalization ability. After random rearrangement using the Shuffle function, it is necessary to select a set of visible encrypted traffic blocks and masked encrypted traffic blocks from the rearranged encrypted traffic image blocks. In this process, the first 42 encrypted traffic blocks are selected as the set of visible encrypted traffic blocks. At the same time, seven encrypted traffic blocks are masked with all zeros to form a completely black encrypted traffic image block. These masked, completely black encrypted traffic image blocks simulate data loss during model training, and the model needs to attempt to reconstruct the masked parts based on the information of visible blocks. This approach is similar to the situation in actual network environments where some data may be lost or unavailable. By learning how to recover complete information from partially visible data, the model can better cope with complex and changing network environments and improve its performance in tasks such as encrypted traffic classification. Moreover, by adjusting the ratio and selection method of visible blocks and mask blocks, the learning performance of the model can be further optimized, achieving a good performance balance in different datasets and task scenarios.

3.2.3. Mamba Encoder and Decoder

The structure of the Mamba encoder and decoder of the model is shown in Figure 3, and its core design combines the normalization Norm, the convolution operation Conv, and the selection of the state-space model SSM for the effective modeling of the stride tokens. The stride tokens are first normalized to ensure numerical stability. On the one hand, the normalized stride tokens are firstly transformed into the feature space through the linear layer to adapt to the subsequent convolution operation and state-space modeling. Then, the convolution module is used to capture the pattern relationship between the local features and their neighboring steps, and the activation function is used to introduce the nonlinear expressive capability, and then, the SSM module is used to model the long-range dependence of the sequence data to further enhance the global feature capture capability. On the other hand, the normalized stride tokens are successively processed using the linear layer and activation function and interact with the output of SSM via element-by-element multiplication for feature interaction, thus achieving the integrated expression of global information and original features of the stride tokens. The synthesized expression is further processed using the linear layer and fused addition with the original stride tokens using the residual connection mechanism, which preserves the original information of the stride tokens and improves the stability of gradient propagation. The design of the entire Mamba coder reflects the balanced modeling of local and global features.

Figure 3. Encoder/decoder of Mamba.

3.2.4. Agent Attention Module

Softmax attention [21] possesses strong expressive power and can effectively capture dependencies in sequence data, but the computational cost rises significantly as the length of input sequence data increases. Linear attention [22] reduces the computational complexity by matrix multiplication with input sequences through learnable weight matrices, but it usually relies on fixed-size weight matrices or inner-product operations during computation and is unable to handle variable-length sequences. The agent attention mechanism [23] combines the advantages of Softmax attention and linear attention. Firstly, the preliminary information extraction is carried out by linear attention, and the preliminary representations of the input sequences are quickly obtained without increasing the computational complexity, and then, these preliminary representations are weighted and averaged using Softmax attention, so that it can flexibly deal with input sequences of arbitrary lengths. We use the proxy attention mechanism in the feature extraction process in the fine-tuning stage to achieve a better balance between performance and computational efficiency.

As shown in Figure 4, the agent attention mechanism introduces the concept of an “agent” performing a linear transformation on the input feature map X using Equation (7), resulting in the agent’s query

Q_{A}

, key

K_{A}

, and value

V_{A}

. Further, using Equation (8), the agent token

A_{A}

is calculated.

\{\begin{matrix} Q_{A} = X \cdot W_{A}^{Q} \\ K_{A} = X {\cdot W}_{A}^{K} \\ V_{A} = X \cdot W_{A}^{V} \end{matrix}

(7)

A_{A} = P o o l (Q_{A})

(8)

where

W_{A}^{Q}, W_{A}^{K}, W_{A}^{V} \in R^{n \times d}

are learnable weight matrices for the agents, and

d

is the characteristic dimension of the element.

Figure 4. Agent attention mechanism.

In the information aggregation process, first, the similarity matrix S between the agent token

A_{A}

and

K_{A}

is calculated, and then, the Softmax calculation is performed on the similarity matrix S to obtain the agent attention weight matrix

α_{A}

, as shown in Equation (9):

α_{A} = S o f t m a x (A_{A} \cdot K_{A}^{T})

(9)

Finally, the obtained agent attention weight matrix

α_{A}

is weighted and summed with the value

V_{A}

to obtain the aggregated information

I_{A g g}

=

α_{A} \cdot V_{A}

. During the information broadcasting process, the similarity matrix

S ’

between the computing agent token

A_{A}

and the query

Q_{A}^{T}

is calculated as

S ’ = A_{A} \cdot Q_{A}^{T}

. A Softmax operation is performed for the similarity matrix S′, resulting in the new attention weight matrix

{α ‘}_{A}

. The Softmax used here is the same as that in Formula 5. Then, the aggregated information

I_{A g g}

is considered as the value, and a weighted sum with

I_{A g g}

using the summation of new attention weights

{α ‘}_{A}

and

I_{A g g}

is calculated to obtain the proxy attention output

O_{A}

, as shown in Equation (10)

O_{A} = {α ‘}_{A} \cdot I_{A g g}

(10)

The ability of the agent attention mechanism to focus more on the interactions of a specific agent or subset of features enhances the model’s ability to capture key information in the input sequence and reduces the computational complexity of the attention mechanism.

3.2.5. SmoothLoss Function

In classification tasks, cross-entropy [24] is usually used as a loss function. However, the cross-entropy loss function may lead to the insufficient generalization ability of the model when dealing with complex and diverse data. We propose an improved cross-entropy loss function, SmoothLoss, which combines label smoothing and L2 regularization [25] to improve the accuracy and generalization ability of the model.

The symbols are defined as follows:

z

: the model output has a shape of

(N, C)

, where

N

is the batch size and

C

is the number of classes.

y

: target labels, with a shape of

(N)

, with each value being an integer between

0

and

C - 1

.

p

: the model output after Softmax processing represents the probability of each category.

θ

: model parameters.

Label smoothing employs Equation (11) to smoothen the real labels, resulting in the smoothened labels

y_{smooth}

, which reduces the overconfidence of the model and helps improve the generalization ability of the model.

y_{smooth} = (1 - α) \cdot y_{one - hot} + \frac{α}{C}

(11)

In the above equation,

y_{one - hot}

is the label vector after one-hot encoding, and

α

is the smoothing parameter, set to 0.1.

The cross-entropy loss for the smoothened label

y_{smooth}

can be represented as Equation (12)

L_{CE} = - \frac{1}{N} \sum_{i = 1}^{N} \sum_{j = 1}^{C} y_{smooth, i, j} \log p_{i, j}

(12)

p_{i, j} = \frac{e^{z_{i, j}}}{\sum_{k = 1}^{C} e^{z_{i, k}}}

(13)

In the above equation,

p_{i, j}

is the probability that the

i

th sample belongs to the

j

th category after Softmax.

y_{smooth, i, j}

represents the smoothened label value of the

i

th sample in the

j

th category.

Furthermore,

L 2

regularization prevents overfitting by penalizing the sum of squares of the parameters

θ_{k}

of the layer

k

of the model, which is defined as Equation (14)

L_{L 2} = λ_{2} \sum_{k} | | θ_{k} {| |}_{2}^{2}

(14)

In the above equation,

λ_{2}

is the weight of

L 2

regularization.

Combining the above two equations, the total loss function can be represented as Equation (15)

L = L_{CE} + L_{L 2}

(15)

4. Experiments

4.1. Comparison Experiments

To evaluate the performance of the ET-Mamba classification model, the accuracy (Ac), precision (Pr), recall (Re), and F1 score metrics were assessed using the USTC-TFC2016 [17], ISCX-VPN2016 [18], and ISCX-Tor2016 [19] datasets to compare its performance with that of the mainstream encrypted traffic classification methods such as DeepPacket [26], FS-Net [27], PERT [28], TSCRNN [29], and ET-BERT [12].

A server with an Intel i5-10400F processor, 32 GB of RAM, and an Nvidia GTX 1080 Ti graphics card was used for the experiments. The batch size used for the experimental parameters was 64, Adam was used as the optimizer, and the initial learning rate was set to 0.0001. The learning rate of 0.001 is the default value of the Adam optimizer, which balances the convergence speed and stability. A batch size of 64 causes a trade-off between memory efficiency, gradient stability, and generalization ability while adapting to hardware parallel computing [30].

The results of the comparison experiments are shown in Table 4, Table 5 and Table 6. The ET-Mamba model outperforms other methods in all evaluation metrics on the USTC-TFC dataset, with accuracy, precision, and a recall of 99.84%, 99.85%, and 99.84%, respectively, which are 0.69%, 0.70%, and 0.72% higher than those of ET-BERT, and 0.74%, 0.75%, and 0.74% higher than those of PERT. The ET-Mamba metrics on the ISCX-VPN dataset outperform the other models except ET-BERT, with 98.19%, 98.23%, and 98.19% accuracy, precision, and recall, respectively, which are 0.71% and 0.68% lower than ET-BERT but a 0.39% higher recall, due to the fact that ET-BERT uses a large number of no-training methods in the pre-training and a large number of unlabeled VPN datasets. The ET-Mamba model outperforms the other models in all metrics on the ISCX-Tor dataset, with accuracy, precision, and a recall of 99.65%, 99.66%, and 99.65%, which are 0.44%, 0.43%, and 0.44% higher than those of ET-Mamba and 2.59%, 2.56%, and 2.55% higher than those of TSCRNN, respectively.

Table 4. Comparison of our model with other models on the USTC-TFC dataset.

Table 5. Comparison of our model with other models on the ISCX-VPN dataset.

Table 6. Comparison of our model with other models on the ISCX-Tor dataset.

Overall, the ET-Mamba model performs well on multiple datasets and effectively improves the accuracy of encrypted traffic classification.

The processing speed of test data can also reflect the efficiency of the model’s operation. We tested the run time of the test data on three datasets. The average run time is shown in Table 7. Experiments show that the run time of our model is shorter than that of other comparison methods.

Table 7. The average run time of the test data on three datasets.

4.2. Ablation Experiments

The impact of modules such as data resampling, agent attention module, and SmoothLoss function on the performance of the baseline model Mamba is systematically analyzed through ablation experiments. The experiments use USTC-TFC2016, ISCX-VPN2016, and ISCX-Tor2016 datasets to ablate the modules one by one or jointly and evaluate them using the metrics of accuracy (Ac), precision (Pr), recall (Re), and F1 score, and the experimental results are shown in Table 8, Table 9 and Table 10.

Table 8. Ablation experiments of our model on the USTC-TFC dataset.

Table 9. Ablation experiments of our model on the ISCX-VPN dataset.

Table 10. Ablation experiments of our model on the ISCX-Tor dataset.

As it is seen in Table 4, Table 5 and Table 6, after replacing the cross-entropy loss function with the SmoothLoss function in the baseline model Mamba, the accuracy improved by 1.15% on the USTC-TFC2016 dataset, by 0.90% on the ISCX-VPN2016 dataset, and by 0.90% on the ISCX-Tor2016 dataset, indicating the effectiveness of SmoothLoss function in optimizing the loss. The introduction of the agent attention module improved the accuracy by 0.75% on the USTC-TFC2016 dataset, 0.97% on the ISCX-VPN2016 dataset, and 1.77% on the ISCX-Tor2016 dataset, which reflects the effectiveness of the agent attention mechanism in enhancing the model’s feature extraction ability. After the introduction of data resampling, the accuracy rate is improved by 0.93% on the USTC-TFC2016 dataset, 0.44% on the ISCX-VPN2016 dataset, and 0.86% on the ISCX-Tor2016 dataset, which illustrates the effectiveness of the data equalization process. Both modules significantly improve the performance of the model when they are introduced in combination, compared to one module alone, showing the complementarity and synergy between the modules. The best performance of the model is achieved when the three modules are combined. The ablation experiments show that the different modules have complementary roles in improving the generalization ability and classification accuracy of the model and can provide significant performance enhancement in a variety of encrypted traffic classification task scenarios.

4.3. Generalization Experiment

In order to test the generalization ability of the ET-Mamba model, we conducted experiments on datasets ISCX-VPN2016 and ISCX-Tor2016. The model weights trained on the ISCX-VPN2016 dataset are used to test the ISCX-TOR2016 test set. The reason why the USTC-TFC dataset is not used for generalization experiments is that its dataset label name and the number of sample categories are different from ISCX-VPN2016 and ISCX-Tor2016.

We selected the AC and F1 metrics, and the experimental results are shown in the Table 11. The experimental results show that the weights trained in ISCX-VPN2016 still have a good performance on the ISCX-Tor2016 dataset, which AC is 0.06% lower than the ISCX-VPN2016 test set itself, and F1 is 0.57% lower. The weights trained in ISCX-Tor2016 also have a good performance on the ISCX-VPN2016 dataset, in which AC is 0.47% lower than the ISCX-Tor2016 test set itself, and F1 is 0.25% lower. The experimental results show that the ET-Mamba model has good generalization ability and can accurately classify datasets that have not been seen before.

Table 11. Generalization experiment on the ISCX-VPN dataset and ISCX-Tor dataset.

5. Conclusions

We propose a crypto traffic classification model based on the Mamba architecture to improve the performance of crypto traffic classification. A resampling method combining Wasserstein GAN and stochastic is proposed to resolve the dataset imbalance problem. The blocks of the traffic grayscale image are features represented by positional embedding, and the agent attention mechanism is used in feature extraction to achieve global information modeling with low computational cost; the random masking approach is used to pre-train the model to learn the intrinsic correlation between the blocks of the traffic grayscale image. The SmoothLoss function is designed to solve the problem of the insufficient generalization ability of the cross-entropy function during training. The model has high classification accuracy on three datasets, USTC-Traffic, ISCX-VPN2016, and ISCX-TOR2016, with an accuracy of 99.85%, 98.19%, and 99.65%, respectively; the number of model parameters is as low as 1.8 M, which demonstrates that the model proposed in this study has high classification performance in encrypted traffic classification.

The characteristics of real encrypted traffic continue to change with the iteration of malware, which makes the traditional model dependent on historical data that are prone to failure. The existing public datasets (ISCX-VPN, ISCX-TOR, USTC-TFC, etc.) are not extensive. Thus, it is difficult to identify rare, encrypted traffic types. In the future, we need to utilize the dynamic fine-tuning model through migration to improve the recognition ability of new traffic and collect more complex traffic types to build a more comprehensive encrypted traffic dataset.

Author Contributions

Conceptualization, J.X., L.C. and W.X.; methodology, L.C., J.X. and L.H.; software, L.C., L.D. and C.W.; validation, J.X., L.C. and L.H.; investigation, J.X., L.C. and W.X.; resources, L.H.; data curation, L.C. and J.X.; writing—original draft preparation, L.C. and J.X.; writing—reviewing and editing, L.H.; supervision, L.H.; project administration, J.X. and W.X.; funding acquisition, L.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (grant number: 62362040).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The USTC-TFC2016 dataset can be accessed publicly at https://github.com/davidyslu/USTC-TFC2016 (accessed on 14 April 2025), the ISCX-VPN2016 dataset can be accessed publicly at https://www.unb.ca/cic/datasets/vpn.html (accessed on 14 April 2025), and the ISCX-VPN2016 dataset can be accessed publicly at https://www.unb.ca/cic/datasets/tor.html (accessed on 14 April 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Gu, A.; Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. arXiv 2023, arXiv:2312.00752. [Google Scholar]
Arjovsky, M.; Chintala, S.; Bottou, L. Wasserstein generative adversarial networks. In Proceedings of the International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; pp. 214–223. [Google Scholar]
Bar-Yanai, R.; Langberg, M.; Peleg, D. Realtime classification for encrypted traffic. In Proceedings of the International Symposium on Experimental Algorithms, Naples, Italy, 20–22 May 2010; Springer: Berlin/Heidelberg, Germany, 2010; pp. 373–385. [Google Scholar]
Rezaei, S.; Liu, X. Deep learning for encrypted traffic classification: An overview. IEEE Commun. Mag. 2019, 57, 76–81. [Google Scholar] [CrossRef]
Huang, Y.F.; Lin, C.B.; Chung, C.M. Research on qos classification of network encrypted traffic behavior based on machine learning. Electronics 2021, 10, 1376. [Google Scholar] [CrossRef]
Wang, W.; Zhu, M.; Wang, J.; Zeng, X.; Yang, Z. End-to-end encrypted traffic classification with one-dimensional convolution neural networks. In Proceedings of the IEEE International Conference on Intelligence and Security Informatics, Beijing, China, 22–24 July 2017; pp. 43–48. [Google Scholar]
Ren, X.; Gu, H.; Wei, W. Tree-RNN: Tree structural recurrent neural network for network traffic classification. Expert Syst. Appl. 2021, 167, 114363. [Google Scholar] [CrossRef]
Mei, Y.; Luktarhan, N.; Zhao, G.; Yang, X. An Encrypted Traffic Classification Approach Based on Path Signature Features and LSTM. Electronics 2024, 13, 3060. [Google Scholar] [CrossRef]
Huoh, T.L.; Luo, Y.; Zhang, T. Encrypted network traffic classification using a geometric learning model. In Proceedings of the 2021 IFIP/IEEE International Symposium on Integrated Network Management, Bordeaux, France, 17–21 May 2021; pp. 376–383. [Google Scholar]
Zou, Z.; Ge, J.; Zheng, H.; Wu, Y.; Han, C.; Yao, Z. Encrypted traffic classification with a convolutional long short-term memory neural network. In Proceedings of the 2018 IEEE 20th International Conference on High Performance Computing and Communications; IEEE 16th International Conference on Smart City; IEEE 4th International Conference on Data Science and Systems, Exeter, UK, 28–30 June 2018; pp. 329–334. [Google Scholar]
Wang, K.; Gao, J.; Lei, X. MTC: A Multi-Task Model for Encrypted Network Traffic Classification Based on Transformer and 1D-CNN. Intell. Autom. Soft Comput. 2023, 37, 619. [Google Scholar] [CrossRef]
Lin, X.; Xiong, G.; Gou, G.; Li, Z.; Shi, J.; Yu, J. Et-bert: A contextualized datagram representation with pre-training transformers for encrypted traffic classification. In Proceedings of the ACM Web Conference, Lyon, France, 25–29 April 2022; pp. 633–642. [Google Scholar]
Wang, T.; Xie, X.; Wang, W.; Wang, C.; Zhao, Y.; Cui, Y. Netmamba: Efficient network traffic classification via pre-training unidirectional mamba. In Proceedings of the 2024 IEEE 32nd International Conference on Network Protocols (ICNP), Charleroi, Belgium, 28–31 October 2024; pp. 1–11. [Google Scholar]
Almomani, A. Classification of virtual private networks encrypted traffic using ensemble learning algorithms. Egypt. Inform. J. 2022, 23, 57–68. [Google Scholar] [CrossRef]
Qin, J.; Liu, G.; Duan, K. A New Imbalanced Encrypted Traffic Classification Model Based on CBAM and Re-Weighted Loss Function. Appl. Sci. 2022, 12, 9631. [Google Scholar] [CrossRef]
Rao, Y.N.; Suresh Babu, K. An imbalanced generative adversarial network-based approach for network intrusion detection in an imbalanced dataset. Sensors 2023, 23, 550. [Google Scholar] [CrossRef] [PubMed]
Wang, W.; Zhu, M.; Zeng, X.; Ye, X.; Sheng, Y. Malware traffic classification using convolutional neural network for representation learning. In Proceedings of the 2017 International Conference on Information Networking, Da Nang, Vietnam, 11–13 January 2017; pp. 712–717. [Google Scholar]
Draper-Gil, G.; Lashkari, A.H.; Mamun, M.S.I.; Ghorbani, A.A. Characterization of encrypted and vpn traffic using time-related. In Proceedings of the 2nd International Conference on Information Systems Security and Privacy, Rome, Italy, 19–21 February 2016; pp. 407–414. [Google Scholar]
Lashkari, A.H.; Gil, G.D.; Mamun, M.S.I.; Ghorbani, A.A. Characterization of tor traffic using time based features. In Proceedings of the International Conference on Information Systems Security and Privacy, Funchal, Portugal, 19–21 February 2017; pp. 253–262. [Google Scholar]
Shapira, T.; Shavitt, Y. Flowpic: Encrypted internet traffic classification is as easy as image recognition. In Proceedings of the IEEE INFOCOM 2019-IEEE Conference on Computer Communications Workshops, Paris, France, 29 April–2 May 2019; pp. 680–687. [Google Scholar]
Heinsen, F.A. Softmax Attention with Constant Cost per Token. arXiv 2024, arXiv:2404.05843. [Google Scholar]
Katharopoulos, A.; Vyas, A.; Pappas, N.; Fleuret, F. Transformers are rnns: Fast autoregressive transformers with linear attention. In Proceedings of the International Conference on Machine Learning, Virtual, 12–18 July 2020; pp. 5156–5165. [Google Scholar]
Han, D.; Ye, T.; Han, Y.; Xia, Z.; Song, S.; Huang, G. Agent attention: On the integration of softmax and linear attention. arXiv 2023, arXiv:2312.08874. [Google Scholar]
Mao, A.; Mohri, M.; Zhong, Y. Cross-entropy loss functions: Theoretical analysis and applications. In Proceedings of the International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023; pp. 23803–23828. [Google Scholar]
El Moutaouakil, K.; Roudani, M.; Ouhmid, A.; Zhilenkov, A.; Mobayen, S. Decomposition and Symmetric Kernel Deep Neural Network Fuzzy Support Vector Machine. Symmetry 2024, 16, 1585. [Google Scholar] [CrossRef]
Lotfollahi, M.; Jafari Siavoshani, M.; Shirali Hossein Zade, R.; Saberian, M. Deep packet: A novel approach for encrypted traffic classification using deep learning. Soft Comput. 2020, 24, 1999–2012. [Google Scholar] [CrossRef]
Liu, C.; He, L.; Xiong, G.; Cao, Z.; Li, Z. Fs-net: A flow sequence network for encrypted traffic classification. In Proceedings of the IEEE INFOCOM 2019-IEEE Conference on Computer Communications, Paris, France, 29 April–2 May 2019; pp. 1171–1179. [Google Scholar]
He, H.Y.; Yang, Z.G.; Chen, X.N. PERT: Payload encoding representation from transformer for encrypted traffic classification. In Proceedings of the 2020 ITU Kaleidoscope: Industry-Driven Digital Transformation, Hanoi, Vietnam, 7–11 December 2020; pp. 1–8. [Google Scholar]
Lin, K.; Xu, X.; Gao, H. TSCRNN: A novel classification scheme of encrypted traffic based on flow spatiotemporal features for efficient management of IIoT. Comput. Netw. 2021, 190, 107974. [Google Scholar] [CrossRef]
Goyal, P.; Dollár, P.; Girshick, R. Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv 2017, arXiv:1706.02677. [Google Scholar]

Figure 1. Resampling flowchart for Wasserstein GAN combined with stochastic.

Figure 2. Framework of ET-Mamba.

Figure 3. Encoder/decoder of Mamba.

Figure 4. Agent attention mechanism.

Table 1. Sample file sizes in the ISCX-VPN dataset.

Traffic Category	File Size
chat	57 MB
email	21 MB
file	17.6 GB
p2p	458 MB
streaming	2.9 GB
voip	4.84 GB

Table 2. Sample file sizes in the ISCX-Tor dataset.

Traffic Category	File Size
audio	1.4 GB
browsing	2 GB
chat	45 MB
file	13 GB
mail	16 MB
p2p	359 MB

Table 3. Sample file sizes in the USTC-TFC dataset.

Traffic Category	File Size	Traffic Category	File Size
BitTorrent	7 MB	Cridex	94 MB
Facetime	2 MB	Geodo	29 MB
FTP	60 MB	Htbot	84 MB
Gmail	9 MB	Miuref	16 MB
MySQL	22 MB	Neris	90 MB
Outlook	11 MB	Nsis-ay	281 MB
Skype	4 MB	Shifu	58 MB
SMB	1.2 GB	Tinba	2 MB
Weibo	1.6 GB	Virut	109 MB
WorldOfWarcraft	15 MB	Zeus	13 MB

Table 4. Comparison of our model with other models on the USTC-TFC dataset.

Methods	Params	Ac	Pr	Re	F1
DeepPacket [26]	4.8 M	96.40	96.50	96.31	96.41
FS-Net [27]	5.3 M	69.82	84.83	70.40	76.75
PERT [28]	110.2 M	99.10	99.10	99.10	99.10
TSCRNN [29]	2.9 M	98.78	98.70	98.60	98.70
ET-BERT [12]	136.4 M	99.15	99.15	99.16	99.16
ET-Mamba	1.8 M	99.84	99.85	99.84	99.84

The bold font in the table indicates that this indicator performs the best among all the methods in the table, the same applies below.

Table 5. Comparison of our model with other models on the ISCX-VPN dataset.

Methods	Params	Ac	Pr	Re	F1
DeepPacket [26]	4.8 M	93.29	93.77	93.06	93.41
FS-Net [27]	5.3 M	67.84	69.17	67.75	68.45
PERT [28]	110.2 M	93.50	94.40	93.50	93.70
TSCRNN [29]	2.9 M	92.89	92.70	92.60	92.60
ET-BERT [12]	136.4 M	98.90	98.91	98.90	98.90
ET-Mamba	1.8 M	98.19	98.23	99.19	98.71

Table 6. Comparison of our model with other models on the ISCX-Tor dataset.

Methods	Params	Ac	Pr	Re	F1
DeepPacket [26]	4.8 M	74.49	75.49	73.99	74.73
FS-Net [27]	5.3 M	80.73	81.46	81.41	81.43
PERT [28]	110.2 M	82.30	70.90	71.70	69.90
TSCRNN [29]	2.9 M	95.16	94.90	94.80	94.80
ET-BERT [12]	136.4 M	99.21	99.23	99.21	99.21
ET-Mamba	1.8 M	99.65	99.66	99.65	99.65

Table 7. The average run time of the test data on three datasets.

Methods	DeepPacket	FS-Net	PERT	TSCRNN	ET-BERT	ET-Mamba
Run Time	6.4 s	7.1 s	147.2 s	3.9 s	180.2 s	2.4 s

Table 8. Ablation experiments of our model on the USTC-TFC dataset.

Resample	Agent	Smooth	CrossEntry	Ac	Pr	Re	F1
×	×	×	√	96.20	95.83	94.88	95.35
√	×	×	√	97.13	95.88	97.95	96.90
×	×	√	×	97.35	95.40	95.85	95.62
√	×	√	×	98.97	96.63	96.44	96.54
×	√	×	√	96.95	96.87	96.33	96.60
√	√	×	√	99.12	99.50	98.25	98.87
×	√	√	×	98.78	98.29	98.75	98.52
√	√	√	×	99.84	99.85	99.84	99.84

√ indicates that the current module has been added, × indicates that the current module has not been added.

Table 9. Ablation experiments of our model on the ISCX-VPN dataset.

Resample	Agent	Smooth	CrossEntry	Ac	Pr	Re	F1
×	×	×	√	96.35	95.80	96.24	96.02
√	×	×	√	96.79	98.26	98.48	98.37
×	×	√	×	97.25	97.15	98.11	97.63
√	×	√	×	99.29	98.77	98.06	98.41
×	√	×	√	97.32	96.55	97.29	96.92
√	√	×	√	97.77	98.80	98.92	98.86
×	√	√	×	97.18	96.84	98.55	97.69
√	√	√	×	98.19	98.23	99.19	98.71

√ indicates that the current module has been added, × indicates that the current module has not been added.

Table 10. Ablation experiments of our model on the ISCX-Tor dataset.

Resample	Agent	Smooth	CrossEntry	Ac	Pr	Re	F1
×	×	×	√	96.35	96.24	95.89	96.06
√	×	×	√	97.21	96.23	97.21	96.72
×	×	√	×	97.25	96.88	97.53	97.20
√	×	√	×	98.49	98.49	97.13	97.81
×	√	×	√	98.12	97.79	98.11	97.95
√	√	×	√	98.06	98.95	97.10	98.02
×	√	√	×	98.12	97.98	98.67	98.32
√	√	√	×	99.65	99.66	99.65	99.65

√ indicates that the current module has been added, × indicates that the current module has not been added.

Table 11. Generalization experiment on the ISCX-VPN dataset and ISCX-Tor dataset.

Model	Test Set from ISCX-VPN2016		Test Set from ISCX-TOR2016
Model	Ac	F1	Ac	F1
Model trained on ISCX-VPN2016	98.19	98.71	99.65	99.65
Model trained on ISCX-TOR2016	99.18	99.40	99.65	99.65

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

ET-Mamba: A Mamba Model for Encrypted Traffic Classification

Abstract

1. Introduction

2. Related Work

3. Methods

3.1. Balancing in Data Processing

3.2. ET-Mamba Model

3.2.1. Model Design

3.2.2. Random Masking

3.2.3. Mamba Encoder and Decoder

3.2.4. Agent Attention Module

3.2.5. SmoothLoss Function

4. Experiments

4.1. Comparison Experiments

4.2. Ablation Experiments

4.3. Generalization Experiment

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics