MS-PreTE: A Multi-Scale Pre-Training Encoder for Mobile Encrypted Traffic Classification

Wang, Ziqi; Qiu, Yufan; Liu, Yaping; Zhang, Shuo; Liu, Xinyi

doi:10.3390/bdcc9080216

Open AccessArticle

MS-PreTE: A Multi-Scale Pre-Training Encoder for Mobile Encrypted Traffic Classification

by

Ziqi Wang

^1,†,

Yufan Qiu

^1,†

,

Yaping Liu

^1,2,*

,

Shuo Zhang

^1,2,* and

Xinyi Liu

¹

Cyberspace Institute of Advanced Technology, Guangzhou University, Guangzhou 510006, China

²

Pengcheng Laboratory, Shenzhen 518000, China

^*

Authors to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Big Data Cogn. Comput. 2025, 9(8), 216; https://doi.org/10.3390/bdcc9080216

Submission received: 25 June 2025 / Revised: 26 July 2025 / Accepted: 30 July 2025 / Published: 21 August 2025

(This article belongs to the Special Issue Machine Learning Methodologies and Applications in Cybersecurity Data Analysis)

Download

Browse Figures

Versions Notes

Abstract

Mobile traffic classification serves as a fundamental component in network security systems. In recent years, pre-training methods have significantly advanced this field. However, as mobile traffic is typically mixed with third-party services, the deep integration of such shared services results in highly similar TCP flow characteristics across different applications. This makes it challenging for existing traffic classification methods to effectively identify mobile traffic. To address the challenge, we propose MS-PreTE, a two-phase pre-training framework for mobile traffic classification. MS-PreTE introduces a novel multi-level representation model to preserve traffic information from diverse perspectives and hierarchical levels. Furthermore, MS-PreTE incorporates a focal-attention mechanism to enhance the model’s capability in discerning subtle differences among similar traffic flows. Evaluations demonstrate that MS-PreTE achieves state-of-the-art performance on three mobile application datasets, boosting the F1 score for Cross-platform (iOS) to 99.34% (up by 2.1%), Cross-platform (Android) to 98.61% (up by 1.6%), and NUDT-Mobile-Traffic to 87.70% (up by 2.47%). Moreover, MS-PreTE exhibits strong generalization capabilities across four real-world traffic datasets.

Keywords:

mobile encrypted traffic classification; pre-training; vision transformer; multi-level representation; focal attention; deep learning

1. Introduction

With the rapid advancement of mobile networks, particularly the widespread adoption of 5G technology, mobile communication has profoundly transformed society. This progress has also escalated security threats to mobile terminals, including malware, eavesdropping, and data theft, which critically endanger user privacy and financial security. Consequently, mobile network traffic classification has been a central research focus in network management and security detection for decades. Accurate traffic classification enables network operators to rapidly respond to dynamic network conditions and efficiently fulfill diverse service requirements, ultimately enhancing quality of service (QoS) and user experience (QoE) [1,2,3].

The widespread adoption of end-to-end encryption protocols (e.g., HTTPS, TLS) in mobile services has significantly undermined the effectiveness of traditional traffic classification methods [4,5]. In this context, deep learning has emerged as a highly promising technical approach for encrypted traffic classification due to its exceptional automatic feature extraction and end-to-end modeling capabilities. Related studies [6,7,8,9,10,11,12,13,14] have shown that supervised deep learning models can leverage large-scale labeled data to automatically learn discriminative features directly from raw traffic packets, thereby achieving high-precision classification of encrypted traffic. However, deep learning methods heavily rely on large-scale manually annotated datasets, which significantly increases training costs.

Several studies [15,16,17,18] attempt to use pre-trained models to circumvent the need for labeled datasets, but exhibit limited effectiveness for encrypted mobile traffic classification. Although PERT [19] and ET-BERT [20] have achieved certain performance in related datasets, their vocabulary-dependent semantic learning methodologies face difficulties when directly processing raw mobile traffic flows, mainly due to the absence of well-defined semantic units, which therefore yields sub-par classification results. YaTC [21] incorporates the MAE pre-training paradigm that enhances traffic representation learning via patch masking and reconstruction mechanisms, yet remains deficient in specialized architectural designs tailored for mobile network traffic’s distinctive attributes. Furthermore, Chen et al. [22] demonstrate that the MAE framework inadequately harnesses the encoder’s representational capabilities. Thus, prevailing pre-training techniques continue to manifest deficiencies in mobile encrypted traffic detection scenarios.

As observed, these pre-trained models [15,16,17] encounter a significant challenge in distinguishing mobile traffic when applied to the classification of encrypted mobile traffic. Mobile traffic exhibits high similarity to non-mobile traffic, making it difficult to distinguish. As shown in Figure 1, to improve service performance, mobile applications commonly integrate third-party shared services [23]. This architecture results in highly similar TCP flow characteristics in different applications [24], further reducing the effectiveness of flow-based classification methods and presents a critical challenge for mobile traffic classification.

In this paper, we propose MS-PreTE, a novel pre-training framework for mobile traffic classification. To address the challenge of distinguishing mobile traffic, MS-PreTE transforms raw packets into multi-level, image-like representations, and incorporates a focal-attention mechanism that enables the model to effectively capture subtle distinctions among highly similar flows, which improves the classification effectiveness for encrypted mobile traffic.

Specifically, we propose a two-phase learning paradigm: the first phase leverages self-supervised pre-training on large-scale unlabeled encrypted traffic to extract generalizable representations, and the second phase employs task-specific fine-tuning to achieve optimal classification performance. In the pre-training stage, we propose two tasks to enable a pre-trained network with ViT [15] architecture to learn general traffic patterns. The first task Masked Representation Prediction (MRP) is used to capture contextual relationships between traffic bytes. The second task Masked Traffic Reconstruction (MTR) is used to reconstruct the masked traffic data based on the representation of the masked part, further enhancing the model’s learning of encrypted traffic patterns. Through the pre-training method, the encoder can focus more on learning the general representation and obtain higher-quality representations. In the fine-tuning stage, we propose a multi-level representation learning module based on the pre-trained model to capture the spatiotemporal information. To focus on the important packet information in the session, we also introduce a focal-attention mechanism. This mechanism adjusts the attention factor to enable the model to distinguish between different applications in highly similar traffic. Finally, we fine-tune the model to complete the classification goal by combining the specific classification task. Our contributions can be summarized as follows:

To more effectively distinguish mobile traffic, we propose a novel multi-level representation mechanism combined with a focal attention mechanism. Specifically, the multi-level representation constructs three distinct information channels to preserve critical traffic features across multiple dimensions. Simultaneously, the focal-attention mechanism introduces an amplitude modulation strategy based on the original attention architecture, which enables the model to focus more precisely on key features during class probability prediction.
We propose MS-PreTE, a two-phase learning framework for mobile traffic classification. In the first phase, MS-PreTE employs self-supervised pre-training on large-scale unlabeled encrypted traffic to extract universal representations. The second phase incorporates task-specific fine-tuning with focal-attention mechanisms to capture spatiotemporal patterns, which enables accurate mobile traffic classification.
We conduct extensive experimental evaluations of MS-PreTE. Specifically, we compare our method with existing state-of-the-art approaches on three mobile application datasets and four real-world traffic, and perform ablation studies to validate the effectiveness of MS-PreTE. Experimental results demonstrate that MS-PreTE achieves state-of-the-art performance on three mobile application datasets, boosting the F1 score for Cross-platform (iOS) to 99.34% (up by 2.1%), Cross-platform (Android) to 98.61% (up by 1.6%), and NUDT-Mobile-Traffic to 87.70% (up by 2.47%), and also performs well on other general traffic classification tasks.

The rest of this paper is structured as follows. Section 2 describes the related work. Section 3 introduces our motivation. Section 4 presents the design details of MS-PreTE. Section 5 explains the experimental setup and results. Section 6 conducts a comprehensive discussion of MS-PreTE. Section 7 concludes the paper.

2. Related Works

Current approaches to encrypted traffic classification generally fall into three categories: machine learning-based, deep learning-based, and pre-training-based methods. While pre-training techniques currently achieve state-of-the-art performance in encrypted traffic classification, many machine learning and deep learning solutions remain conceptually innovative and practically relevant. In Section 5, we systematically compare our method with representative works from all three paradigms, evaluating their respective classification performance.

2.1. Machine Learning Based Traffic Classification Methods

Several studies have employed machine learning (ML) techniques for traffic classification using statistical features [25,26,27,28]. AppScanner [29] utilizes packet size statistics from TCP flows to identify mobile application traffic patterns, implementing both Support Vector Machine (SVM) and Random Forest (RF) classifiers. Building on this approach, BIND [30] incorporates both statistical and temporal features of TCP flows within a supervised learning framework. While these ML-based methods demonstrate capability in analyzing complex traffic patterns, they face two fundamental limitations: dependence on manually engineered features requiring domain expertise, and the need for feature space optimization tailored to specific application scenarios [26].

2.2. Deep Learning Based Traffic Classification Methods

The development of deep learning has significantly influenced traffic classification research. Wang et al. [9] proposed mapping fixed-length byte sequences of traffic data into single-channel grayscale images, demonstrated that such representations exhibit distinct inter-class variations while maintaining intra-class similarity. Subsequent studies extended this direction: DF [7] employed Convolutional Neural Networks (CNNs) for direct traffic analysis, while FS-Net [6] introduced an autoencoder-based framework. To better capture temporal dynamics, TSCRNN [8] developed a hybrid architecture combining CNNs and Recurrent Neural Networks (RNNs). Although these deep learning methods automatically extract features from raw traffic bytes, they universally require substantial amounts of labeled training data for supervised learning.

2.3. Pre-Training Based Traffic Classification Methods

In recent years, pre-trained models have demonstrated remarkable success in fields [15,16,17]. In traffic classification, PERT [19] explored the use of ALBERT [17] for classification, while ET-BERT [20] introduced tailored traffic representations and pre-training tasks, achieving strong performance. These works validate the efficacy of pre-training architectures for encrypted traffic classification. However, directly applying vocabulary-based semantic learning to raw traffic bytes is suboptimal due to the absence of well-defined semantic units. In contrast, YaTC [21] adopted the MAE pre-training paradigm, leveraging patch masking and reconstruction for more effective representation learning. Nevertheless, YaTC lacks specialized designs for mobile network traffic, including task-specific optimizations and packet-level interaction modeling. Additionally, Chen et al. [22] highlighted that the MAE architecture underutilizes encoder representation potential. Thus, novel input representations and pre-training strategies remain essential for advancing encrypted traffic analysis.

3. Observation and Motivation

Encrypted traffic classification aims to effectively identify and categorize network traffic protected by various encryption technologies. As illustrated in Figure 2, a network packet primarily consists of two components: the header and the payload. The header contains control information (e.g., IP addresses, port numbers, and protocol types) for routing and transmission, which remains in plaintext; while the payload carries the actual transmitted data (e.g., files, web content, or other media), typically in encrypted form.

Traditional traffic classification methods predominantly rely on visible header fields to construct packet fingerprints for detection. However, with the continued advancement of privacy-preserving technologies and the widespread integration of third-party shared services in mobile applications [23], encrypted mobile traffic classification now faces the challenge of accurately identifying encrypted traffic from mobile endpoints.

As observed, treating encrypted traffic payloads as image pixels presents a viable approach [31,32]. Although encrypted payloads appear as ciphertext that seems random and disordered at surface level, fundamental differences exist in the statistical characteristics and randomness patterns across ciphertexts generated by different applications. These subtle variations reflect distinct encryption algorithms and their implementations, thereby enabling potential traffic classification.

Sengupta et al. [33] successfully distinguished between applications by analyzing inter-ciphertext differences, demonstrating that encrypted traffic is not completely random but contains distinguishable latent features. Furthermore, unlike natural language, ciphertext in encrypted traffic exhibits high entropy and lacks structural organization, which is different from linguistically structured data [34].

In this context, computer vision (CV)-based methods demonstrate unique advantages over natural language processing (NLP) techniques. While these pixel-like data units lack explicit semantics, their spatial combinations reveal meaningful patterns and latent structures. Notably, fundamental CV features (e.g., edges, corners, textures) align remarkably well with the inherent statistical properties and implicit structures of encrypted data. Consequently, employing advanced CV-based pre-training models can more effectively learn universal representations of ciphertext. This approach not only captures the distinctive characteristics of encrypted traffic more effectively, but also enhances the accuracy and efficiency of downstream classification tasks.

4. MS-PreTE

As shown in Figure 3, MS-PreTE first preprocesses raw traffic and extracts key traffic features through the multi-level representation mechanism. The framework then sequentially performs pre-training and fine-tuning with a focal-attention mechanism to ensure accurate mobile traffic classification.

4.1. Preprocessing

MS-PreTE performs preprocessing on raw network data through three main stages: flow extraction, burst extraction, and header masking.

Flow extraction: MS-PreTE identifies and extracts individual network flows from raw traffic using five-tuple. Let network traffic be denoted as

P = {p_{i} ∣ i \in [1, n] \subseteq N^{+}}

, where each

p_{i}

represents an individual packet. A bidirectional flow

F l o w

is then defined as a subset of P, expressed as

F l o w = {p_{j} ∣ p_{j} \in P \land j \in N^{+}} \subset P

. Here,

p_{j}

denotes bidirectional packets transmitted between two ports of distinct hosts using a specific protocol. The flow length is defined as

| F l o w | = n_{s}

, where

n_{s}

indicates the total packet count within flow

F l o w

. As a flow fundamentally constitutes a bidirectional communication process, its complete record should encompass all request and response data units generated during application-specific interactions between communicating parties. This divergence in transmission patterns enables flows to effectively characterize application behaviors, thus distinguishing different network scenarios.

Burst extraction: MS-PreTE partitions each session into multiple bursts to generate a burst sequence for each flow. For mobile application traffic, the interaction characteristics at the packet level are also of great significance. Therefore, the extraction of burst becomes imperative to derive higher-fidelity representations. For the data flow

F l o w = p_{j} ∣ j \in [1, n] \subseteq N^{+}

, each packet

p_{j}

is associated with a direction attribute

d_{j} \in up, down

. If there exists a consecutive subset of packets

B = p_{k}, p_{k + 1}, \dots, p_{k + m} \subseteq F l o w

such that

\forall p_{i} \in B

,

d_{i} = d_{k}

, then B is defined as a burst. Compared to flows, bursts emphasize the fine-grained interaction logic of data units within a session, such as individual request–response processes or transactional operations governed by application-layer protocols.

Header masking: MS-PreTE masks specific fields in the packet headers to prevent the model from fitting to non-essential information. Specifically, MS-PreTE anonymizes the input sequence by randomly replacing the MAC address, IP address, and port fields of each packet at the burst level. This anonymization is critical, as data flows generated by the same application often share identical address or port information, which may act as explicit identifiers. Such identifiers can hinder the model’s ability to learn unbiased and generalizable representations across different flow categories.

4.2. Flow Representation Model

To better distinguish the mobile traffic, we propose a multi-level representation designed to preserve significant flow information for downstream models. Based on the research of Zhao et al. [21], we design an innovative traffic representation model that converts the original network traffic data into image format as input. This model can distinguish between the header information in the packet and the semantic information of the encrypted payload and adjust the ratio of the two to optimize the representation. In addition, traffic bursts and byte position information are also incorporated into the model to enhance the richness of data representation further.

Specifically, we consider that the semantic information of bytes at different positions in a packet is inconsistent. In particular, the difference between the header information and the encrypted payload, which often contains encrypted text that is difficult to parse. To capture the multi-level potential information in the traffic, we further convert the preprocessed packet into an image representation. For a given data stream

F l o w = {p_{i}}_{i = 1}^{n}

, we encode each packet into a fixed-size matrix:

M_{i} = [\begin{matrix} H_{1} \\ H_{2} \\ P_{1} \\ ⋮ \\ P_{6} \end{matrix}] \in Z^{8 \times 40}

(1)

where the first two rows,

H_{1}

and

H_{2}

, store network-layer header information (e.g., IP, TCP, UDP), and the remaining six rows (

P_{1} \dots P_{6}

) contain the truncated application payload (limited to 240 bytes). Zero-padding is applied if the packet length is insufficient, and excess data is truncated when necessary. Then, we construct a three-dimensional matrix to represent the network traffic, comprising the raw byte value channel, packet position channel, and burst identification channel.

The raw byte value channel $C_{byte}$ stores the raw value of each byte in a packet, aiming to preserve the complete content of the packet and provide fine-grained data information to support various complex analysis tasks:

$C_{byte} (x, y) = M_{i} (x, y) \in [0, 255]$

(2)
The packet position channel $C_{pos}$ records the specific positions of each byte in its corresponding packet, which is essential for understanding the packet structure and the underlying relationship of its internal bytes:

$C_{pos} (x, y) = ⌊\frac{40 x + y}{320} \times 255⌋$

(3)
The burst identification channel $C_{burst}$ is designed to preserve the dynamic characteristics of traffic flows, such as periodic transmission patterns in voice communications, which captures temporal relationships and burst traffic features within the data stream. The temporal information of packets within burst sequences is encoded as pixel values. For any packet belonging to burst $B_{k}$ , all corresponding burst information positions are labeled with k, which defines the $C_{burst}$ as:

$C_{burst} (x, y) = k, if p_{i} \subseteq B_{k}$

(4)

To balance computational efficiency and information retention, we employ a weighted linear fusion strategy to generate the final grayscale image I, which is subsequently fed into the pre-training model:

I (x, y) = \frac{75 C_{byte} + 38 C_{pos} + 15 C_{burst}}{128} \in [0, 255]

(5)

4.3. Pre-Training Phase

We introduce a novel pre-training paradigm for learning encrypted traffic data, aiming to learn unbiased and informative representations through an encoder-regressor-decoder architecture. The pre-trained model consists of three parts: an encoder, a latent semantic regressor, and a decoder. Architecturally, the encoder module employs an 4-layer self-attention-based Transformer, the semantic regressor consists of a stack of 4 cross-attention Transformer layers, and the decoder similarly comprises 4 layers of self-attention-based Transformer.

The pre-training phase involves two distinct tasks: (1) Masked Representation Prediction (MRP) task, which is similar to the Masked Language Model in natural language processing (NLP), is used to understand the contextual relationship between traffic bytes by predicting the representation of traffic data parts (e.g., certain packets or features) that are randomly masked; (2) Masked traffic reconstruction (MTR) task, where the model needs to reconstruct the traffic data from the representation of the masked part to further enhance the understanding of the encrypted traffic pattern.

Encoder: The Encoder processes the uncovered image patches

x_{v}

and maps them to latent representations

h_{v}

. We construct a traffic encoder

E = {E n c o d e r_{i} (\cdot) | i \in N}

based on ViT [15], consisting of N layer-stacked Transformer encoders.

Latent Semantic Regressor: The semantic regressor predicts the representations of the masked patches

{\bar{h}}_{m}

to accomplish the MRP task. This prediction is based on the representations of the unmasked patches

h_{v}

, with the positional encodings of both the masked patches

P_{m}

and the unmasked patches

P_{v}

jointly considered. We employ a Transformer encoder based on the cross-attention mechanism as the semantic regressor. Specifically, the initial query

Q_{m}

(referred to as the masked query) functions as a learnable mask token parameter that remains identical for all masked image patches. The keys and values are identical prior to linear projection, being composed of the representations

h_{v}

of unmasked patches, and the output from the preceding cross-attention layer (for the first layer, this is

Q_{m}

itself). When computing cross-attention weights between queries and keys, the model incorporates the

P_{m}

and

P_{v}

. Notably, the latent representations

h_{v}

remain frozen throughout this process. The formula for the attention mechanism is as follows:

Attn (Q_{m}, K, V) = softmax (\frac{Q_{m} K^{T} P_{m}}{\sqrt{d_{k}}}) V,

(6)

h_{m} = Cross - Attention (h_{v}, P_{m})

(7)

where

K = V = h_{v}

,

d_{k}

is the dimension of the keys, and

h_{m}

is the representation predicted by the model.

Decoder: The decoder reconstructs the masked image patches

{\bar{x}}_{m}

from their latent representations

{\bar{h}}_{m}

to accomplish the MTR task. Similar to the encoder, the decoder consists of multiple self-attention-based Transformer layers followed by a linear prediction layer at the output stage. The decoder exclusively takes the latent representations of masked patches (output from the regressor), and their corresponding positional embeddings as input, while deliberately excluding any direct information from unmasked patches.

The pre-training workflow is illustrated in Figure 3. MS-PreTE first uniformly partitions the input matrix x into non-overlapping patches, then randomly masks a portion of these patches with probability

r (0 < r < 1)

, thereby dividing them into unmasked patches

x_{v}

and masked patches

x_{m}

. This masking strategy requires the model to learn the global structural features of encrypted traffic from contextual information in the absence of partial input, rather than relying on local superficial patterns. Furthermore, to enhance the model’s capability for global semantic modeling, we introduce a special class token

x_{c l s} \in R^{d}

as a learnable parameter, which subsequently serves to aggregate global semantic information from the entire input sequence.

The detailed pre-training implementation and model architecture are illustrated in Figure 4. First, the feature encoder processes the unmasked patches

x_{v}

to extract their fundamental feature representations

h_{v}

. Subsequently, the semantic regressor combines these unmasked patch representations

h_{v}

with their positional encodings to predict the corresponding masked patch representations

{\bar{h}}_{m}

. Finally, the traffic decoder reconstructs the masked patches

{\bar{x}}_{m}

based on these predicted representations.

The pre-training process is optimized through carefully designed loss functions. First, to evaluate the model’s accuracy in reconstructing masked image patches, we introduce pixel-level Mean Squared Error (MSE) as the reconstruction loss function, defined as follows:

L_{r e c} = M S E (X_{m}, {\bar{X}}_{m})

(8)

Second, to ensure the latent semantic regressor learns data distributions consistent with the encoder, we introduce an alignment loss for predicting the representations of masked patches. This loss measures the alignment between the predicted representations and the ground-truth representations of the masked patches. Specifically, we input the masked patches

x_{m}

into the feature encoder to obtain the ground-truth representations

h_{m}

, and compute the loss using MSE, defined as follows:

L_{a l i g n} = M S E (h_{p m}, {\bar{h}}_{p m})

(9)

Finally, the overall loss function is composed of the above two losses weighted by a parameter

λ

, defined as follows:

L_{t o t a l} = L_{r e c} + λ L_{a l i g n}

(10)

4.4. Fine-Tuning Phase

While the pre-training phase equips the model with fundamental understanding of mobile encrypted traffic, direct application to specific classification tasks still presents challenges—particularly when different applications share common third-party services. For instance, when mobile and non-mobile applications simultaneously rely on the same CDN, their generated TCP flows may exhibit highly similar patterns. To address these challenges effectively, we design a dual-layer encoding architecture that progressively extracts discriminative features at both packet-level and flow-level. During fine-tuning, we retain the pre-trained encoder’s structure and weights while replacing the original semantic regressor and decoder with task-specific fully connected layers for classification. Notably, recognizing that different applications may exhibit subtle differences in critical packets or features despite overall traffic pattern similarities, we incorporate a focal attention mechanism into the flow-level encoder to enhance the model’s sensitivity to these fine-grained discriminative characteristics.

Packet-Level Encoder. The packet-level encoder focuses on packet-level information interaction, specifically the temporal relationships between bytes, to extract intra-packet feature patterns. In other words, it ignores patch features from other packets and outputs packet-level representations. Consequently, it employs multi-head self-attention only within patches of the same packet to facilitate localized information exchange. The attention computation for each head is defined as:

\begin{matrix} Q = x_{l} W^{Q}, K = x_{l} W^{K}, V = x_{l} W^{V}, \end{matrix}

(11)

\begin{matrix} Attn (Q, K, V) = softmax (\frac{Q K^{T}}{\sqrt{d_{k}}}) V . \end{matrix}

(12)

where

W^{Q}, W^{K}, W^{V} \in R^{d \times d_{k}}

are the trainable parameters in the model, and Q, K, and V denote the query, key, and value vectors, respectively.

Flow-Level Encoder. The flow-level encoder addresses the challenge of identifying subtle inter-application differences. For computational efficiency, we avoid applying attention mechanisms directly to all image patches, instead processing the outputs from the packet-level encoder. Specifically, MS-PreTE first performs row-wise average pooling on packet-level features:

x_{r} = AvgPooling (x_{p}^{'})

(13)

where

x_{p}^{'} \in R^{N \times d}

denotes the output from the packet-level encoder, and

x_{r}

denotes the result following row Average pooling.

However, given that most mobile applications share common third-party services (e.g., CDNs), their TCP flows exhibit high similarity. To address this fundamental challenge, we propose an innovative focal attention mechanism, whose core principle is to adaptively modulate the “sharpness” of attention distributions, enabling the model to precisely identify discriminative features within highly similar traffic patterns. In standard self-attention layers, the computation proceeds as follows before applying the softmax function: First, the dot-product attention score matrix A between queries Q and keys K is calculated as

A = Q K^{T}

. This matrix is then scaled by

\frac{1}{\sqrt{D}}

to normalize the values, thereby enhancing training stability. The complete attention computation process is defined below:

SelfAttn (F) = softmax (\frac{Q K^{T}}{\sqrt{D}}) V,

(14)

As demonstrated by Wei et al. [35], this scaling operation aims to prevent the generation of excessively small weight values W. However, in the context of mobile traffic classification, we require more targeted attention distributions. Considering that the attention vector A can be decomposed into two components—magnitude

| | A | |

and direction

\hat{A}

:

A = | | A | | \cdot \hat{A},

(15)

Assume that the dot product calculated from the irrelevant flow is relatively small, then the weight for the irrelevant flow

w_{i}

can be calculated according to the softmax formula:

w_{i} = \frac{e^{∥ A ∥ {\hat{a}}_{i}}}{\sum_{j = 1}^{K} e^{∥ A ∥ {\hat{a}}_{j}}} .

(16)

where

{\hat{a}}_{i}

is a vector in matrix A. Therefore, we achieve the focal-attention mechanism by increasing the magnitude of the dot product A:

w_{i} = \frac{e^{\frac{∥ A ∥}{γ} {\hat{a}}_{i}}}{\sum_{j = 1}^{K} e^{\frac{∥ A ∥}{γ} {\hat{a}}_{j}}},

(17)

{Focal - SelfAttn}^{(F)} = softmax (\frac{N (Q) N {(K)}^{T}}{γ}) V,

(18)

where N is the normalization function, and

γ

is the focus hyperparameter. By adjusting the

γ

value, we can concentrate the attention distribution more on discriminative features, enabling the softmax function to assign minimal weights to irrelevant flow characteristics. Finally, classification optimization is achieved through a fully connected layer and cross-entropy loss:

\begin{matrix} L_{C r o s s E n t r o p y} & = CrossEntropy (softmax (\hat{Y}), Y) \\ = - \sum_{i = 1}^{c} log \frac{exp ({\hat{y}}_{i})}{\sum_{j = 1}^{c} exp ({\hat{y}}_{j})} y_{i} \end{matrix}

(19)

where

Y = {y_{1}, y_{2}, \dots, y_{c}}

represents the true label probabilities, while

\hat{Y} = {{\hat{y}}_{1}, {\hat{y}}_{2}, \dots, {\hat{y}}_{c}}

denotes the predicted probability vector. The symbol c indicates the number of classes in the fine-tuned dataset.

5. Experiments

This section provides a comprehensive evaluation of the MS-PreTE model’s performance in encrypted traffic classification tasks through multiaspect evaluation. Initially, we perform classification experiments across multiple specific encrypted traffic classification tasks (Section 5.2) to illustrate the effectiveness of MS-PreTE in various encryption scenarios. Subsequently, we compare MS-PreTE against existing methods for encrypted traffic classification (Section 5.3) to elucidate its relative advantages. Following this, we conduct ablation studies to assess the contribution of each component to the overall performance of the model (Section 5.4). Additionally, potential feature visualization through t-SNE further substantiates the model’s decision boundaries (Section 5.5). Acknowledging that pre-trained models may face limitations due to unknown traffic distributions, we undertake an assessment of transfer learning capabilities (Section 5.6). Finally, we analyze the model’s computational resource requirements (Section 5.7).

5.1. Experiment Setup

5.1.1. Datasets

To comprehensively evaluate the effectiveness and generalizability of MS-PreTE on encrypted traffic classification tasks, five different traffic classification tasks were set for seven public datasets. The Table 1 shows the statistics of these tasks and their corresponding datasets. We do not perform additional data augmentation operations on the original datasets. Although class imbalance does exist in some datasets, we maintain that this largely reflects the inherent characteristics of real-world network traffic distributions. To preserve the authenticity and practical relevance of our experimental setup, we deliberately modeled and evaluated the raw data without artificial augmentation, thereby validating the model’s robustness and generalization capability in realistic imbalanced scenarios.

Encrypted Traffic Classification on the General Application (ETCGA) Task

This task aims to classify application traffic under standard encryption protocols. We tested on Cross-Platform (iOS) [25] and Cross-Platform (Android) [25], with 196 and 215 apps, respectively. The iOS and Android apps were collected from the top 100 Apps from the US, China and India. This dataset has the largest number of categories and long-tail data distribution over all classes. We also tested it with NUDT-Mobile-Traffic [36]. This dataset contains traffic data for 350 mobile apps, making it the largest number of categories in a publicly available mobile app dataset, which makes it more challenging than other datasets. Each sample typically includes statistical features such as flow duration, average packet length, forward/backward packet counts, and standard deviation of packet lengths, with labels corresponding to specific application names like Instagram or YouTube.

Encrypted Traffic Classification on the Malware Application (ETCMA) Task

This task aims to classify encrypted traffic consisting of malware and benign apps. We used a commonly used dataset for current research, namely, USTC-TFC-2016 [37]. This dataset containing encrypted traffic from 10 malware and 10 benign applications. This widely recognized dataset was published by a researcher from the University of Science and Technology of China (USTC) and has been extensively adopted by researchers. Each traffic sample contains key metrics such as maximum/minimum inter-packet intervals, packet length variance, etc., with labels indicating the corresponding encryption protocol or service type.

Encrypted Traffic Classification on VPN (ETCV) Task

This task classifies encrypted traffic from applications that use virtual private networks (VPN) to communicate. We used a commonly used dataset for current research, ISCX-VPN-2016 [38], which is also publicly available from the University of New Brunswick, and classified network packets captured at the data link layer according to the application they were generated with (e.g., Chat, FTP, and VOIP), divided into different pcap files, resulting in the ISCX-VPN-2016 dataset containing seven categories. Each traffic sample contains key metrics such as maximum/minimum inter-packet intervals, packet length variance, etc., with labels indicating the corresponding encryption protocol or service type.

Encrypted Traffic Classification on Tor (ETCT) Task

This task aims to classify encrypted traffic that uses the Onion Router (Tor) for communication privacy enhancement. The dataset is called ISCX-Tor-2016 [39], which contains sixteen applications, grouped into eight categories. This kind of traffic further obscures the behavior by obfuscating the communication between the sender and the receiver through a distributed routing network, which is more challenging for traffic classification as the pattern extraction of traffic becomes harder. Each traffic sample contains key metrics such as maximum/minimum inter-packet intervals, packet length variance, etc., with labels indicating the corresponding encryption protocol or service type.

Traffic Classification on the Malware Application (TCIoT) Task

This task aims to classify massive attack traffic in iot environments. It uses the CIC-IoT-2023 [40] dataset which is the latest IoT traffic dataset collected by the Canadian Institute for Cybersecurity (CIC). It comprises benign background traffic and seven distinct malicious scenarios: Brute-force, Heartbleed, Botnet, DoS, DDoS, Web attacks, and infiltration, collected over a 10-day period. The network infrastructure includes 50 malicious machines in a local network, as well as 420 victim machines and 30 victim servers distributed across five departments. The dataset includes 7457 packets and 3514 flows. There are 33 categories in this dataset. Each sample record contains fields such as timestamp, source/destination ports, average packet length, etc., with attack type as the classification label.

5.1.2. Implementation Details

In this study, we first pre-trained the model on all seven combined datasets to fully explore universal traffic representation capabilities. Subsequently, we fine-tuned the model separately on each individual dataset to adapt to their specific traffic feature distributions. During fine-tuning, all datasets were split into training and test sets at an 8:1 ratio.

The detailed hyperparameter settings are as follows. In the pre-training phase, the batch size is 512, and the total step size is 150,000. We used a linear learning rate scaling rule and set the base learning rate to

1 \times 10^{- 3}

using the AdamW optimizer. The masking ratio for random masking patches was set to 0.9. Then, for the fine-tuning phase, we use the AdamW optimizer with epoch set to 200, where the base learning rate is set to

2 \times 10^{- 3}

, and the batch size is 64. The proposed method is implemented using PyTorch 2.1.1 and trained on a server with two NVIDIA GeForce RTX4090 GPUs.

5.1.3. Evaluation Metrics

To measure the classification performance of our method, we calculate the number of True Positive (

T P

), True Negative (

T N

), False Positive (

F P

), and False Negative (

F N

). We also use Accuracy (AC), Precision (PR), Recall (RC), and F1 Score (F1) as evaluation metrics. The reason for choosing them is that these metrics are very common in different machine learning tasks [19,41,42,43,44]. Since encrypted traffic classification is a multicategory task, we need to calculate the above metrics for each category separately.

Specifically, we use N as the total number of training samples; TPc is used to indicate the quantity that originally belongs to category c, and is predicted by the model as c; FPc indicates the quantity that originally does not belong to category c but is predicted by the model as c; TNc indicates the quantity that does not belong to category c, and is not predicted to be class c; and FNc indicates the quantity that belongs to class FNc but is misclassified to another class. Hence, the definitions of the aforementioned four evaluation metrics can be given as follows:

\begin{matrix} A C_{c} & = \frac{T P_{c} + T N_{c}}{T P_{c} + T N_{c} + F P_{c} + F N_{c}} \\ P R_{c} & = \frac{T P_{c}}{T P_{c} + F P_{c}} \\ R C_{c} & = \frac{T P_{c}}{T P_{c} + F N_{c}} \\ F 1_{c} & = \frac{2 \times P R_{c} \times R C_{c}}{P R_{c} + R C_{c}} \end{matrix}

(20)

5.2. Classification Performance

As shown in Figure 5, MS-PreTE exhibits consistent training stability and convergence across all datasets. Notably, the model attains high accuracy within 40 epochs, demonstrating rapid convergence—an advantageous trait for large-scale mobile app classification tasks in real-world applications. Analysis of the loss curves reveals that the model enters a stable phase after convergence, with minimal performance improvements from further training. This indicates that the model successfully captures the features of the data rather than merely memorizing training samples. Based on this observation, we set the training epochs to 200, achieving an optimal balance between computational efficiency and model performance.

Furthermore, to further evaluate the classification performance of MS-PreTE, we conducted a visual analysis of confusion matrices on the test datasets, as illustrated in Figure 6. The NUDT-Mobile-Traffic dataset presents an demanding scenario, containing 350 application categories—far exceeding the scale of other datasets. In this large-scale classification task, the model still maintains an accuracy rate above 90%, fully demonstrating its capability to handle complex classification challenges. This result strongly demonstrates the unique advantages of our approach in processing highly obfuscated traffic. Due to space limitations, only the results of a subset of categories are displayed in the figure. Nevertheless, the results for the remaining categories remain consistent with these findings.

5.3. Comparison with State-of-the-Art Methods

We compare MS-PreTE with a set of representative state-of-the-art (SOTA) methods for encrypted traffic classification, including (1) statistical feature methods: AppScanner [29], CUMUL [44], BIND [30]; (2) deep learning methods: FS-Net [6], AppNet [45], DF [7], Beauty [46], SAM [41]; (3) graph neural network methods: GraphDApp [42], TFE-GNN [43], FB-GNN [47]; and (4) pre-training methods: PERT [19], ET-BERT [20], YaTC [21]. We conducted experiments on all baseline methods in the mobile traffic dataset, and the experimental results are shown in Table 2 and Table 3.

In mobile application classification tasks (ETCGA), MS-PreTE demonstrates remarkable advantages. As shown in Table 2, the model achieves an F1-score of 99.34% on the Cross-Platform (iOS) dataset, with 2.1% improvement over the suboptimal method. Similarly, it attains a 98.61% F1-score on the Cross-Platform (Android) dataset, with a 1.6% improvement. These improvements are particularly noteworthy given the already high baseline performance. On the more challenging NUDT-Mobile-Traffic dataset with 350 application categories, MS-PreTE maintains a robust F1-score of 87.70%, significantly outperforming other approaches. These results conclusively validate the effectiveness of our proposed multi-level representation and focal attention mechanism in handling large-scale classification tasks. The NLP-based pre-training approaches (e.g., ET-BERT) demonstrate suboptimal performance on the NUDT-Mobile-Traffic dataset, achieving only 79.4% F1-score. This observation validates our analysis: the byte distribution in encrypted traffic resembles image pixels rather than natural language, making computer vision paradigms more suitable for processing such data. While YaTC, which also adopts a vision-inspired approach, shows relatively better performance, its simplistic masked reconstruction strategy fails to capture complex traffic patterns effectively, resulting in inferior performance compared to our method.

MS-PreTE demonstrates strong adaptability across the remaining tasks. As shown in Table 3, the model achieves a 97.55% F1-score for malware detection in ETCMA tasks (USTC-TFC-2016), outperforming all baseline methods, and attains 96.80% classification accuracy for VPN traffic in ETCV tasks, representing a 1.68% improvement over the state-of-the-art approach. These results confirm that our method is not only effective for mobile application classification but also excels in other encrypted traffic analysis scenarios. However, MS-PreTE shows marginally lower performance than YaTC (by 0.33%) on TCIoT tasks, likely due to the relatively simple and highly repetitive nature of IoT traffic patterns where our focal attention mechanism offers less pronounced advantages.

These experimental results demonstrate the effectiveness of MS-PreTE across a variety of encrypted traffic classification tasks.

5.4. Ablation Study

To gain a deeper understanding of the contributions of each key component in MS-PreTE to model performance, we conducted ablation experiments on multiple real-world traffic datasets. The ablation studies included: (1) Replacing the multi-level encoder structure with a single-layer encoder (Single Encoder); (2) Removing the focal attention mechanism and using standard attention instead (w/o Focal-Attention); (3) Eliminating the traffic representation module and relying solely on raw byte sequences (w/o FRM); (4) Removing the pre-training phase (w/o Pre-train). The results of the ablation study are presented in Table 4 and Table 5.

First, simplifying the multi-level encoder structure to a single-layer encoder significantly degrades model performance. For instance, on the Cross-Platform (iOS) dataset, the F1-score drops from 99.34% (full model) to 97.82%, with similar declines observed on other datasets such as NUDT-Mobile-Traffic and ISCX-VPN-2016. This demonstrates the irreplaceable importance of multi-scale feature extraction for capturing both packet-level and flow-level complex patterns, as a single-layer encoder fails to adequately learn high-level traffic structural information.

Second, replacing the focal-attention mechanism with the standard self-attention mechanism (w/o Focal-Attention) leads to varying degrees of F1-score degradation across all datasets. For example, on Cross-Platform (iOS), the F1-score drops to 98.29%. This indicates that the focal mechanism effectively enhances the model’s ability to identify critical features, particularly when distinguishing highly similar traffic (e.g., different applications sharing CDN services). Without it, the model’s discriminative accuracy is notably reduced.

Furthermore, eliminating the traffic representation module (w/o FRM) forces the model to rely solely on raw byte sequences for classification, resulting in a further F1-score decline to 98.92%. This performance loss highlights the importance of structured traffic representations (e.g., positional encoding and burst detection) in providing essential contextual and temporal features, thereby improving classification comprehensiveness and robustness.

Finally, removing the pre-training phase (w/o Pre-train) has the most severe impact, reducing the F1-score on Cross-Platform (iOS) to 95.10%, with similarly significant drops across other datasets. This strongly validates that self-supervised pre-training substantially enhances the model’s feature extraction and generalization capabilities, particularly in data-scarce scenarios, effectively mitigating overfitting risks.

In summary, each core component of MS-PreTE plays a critical role in traffic classification performance: the multi-scale encoder and focal attention mechanism ensure deep semantic modeling, while the traffic representation module and pre-training phase further improve generalization and robustness, enabling MS-PreTE to achieve performance across diverse traffic classification tasks.

5.5. t-SNE Visualization on Latent Feature Representations

To further validate the effectiveness of MS-PreTE in learning latent characteristics for unknown scenarios, Figure 7 present visualizations of the learning process of latent characteristic representation in the ISCX-Tor-2016 dataset. We employed both t-SNE and PCA for dimensionality reduction of high-dimensional features. These techniques enable clearer visualization of relationships between data points.

Experimental results on the ISCX-Tor-2016 dataset demonstrate that as training progresses, t-SNE progressively identifies distinct cluster boundaries corresponding to various network activities including chat, browsing, and email. These well-formed clusters indicate that MS-PreTE successfully captures distinctive latent features associated with each activity type. The tight intra-cluster distribution of data points further verifies the model’s capability to differentiate network behaviors based on inherent patterns. The clearly defined clusters strongly suggest MS-PreTE’s potential for accurate network flow classification in previously unseen scenarios. Moreover, the comparative analysis between t-SNE and PCA reveals additional structural characteristics of the data. While PCA primarily extracts the most significant linear components, t-SNE excels at preserving local similarities and uncovering latent nonlinear relationships embedded within high-dimensional data.

5.6. Transfer Learning Analysis

In the transfer learning ability experiment, we choose the ISCXTor2016 dataset for the experiment. Because anonymous networks communicate through a three-layer proxy, their TCP flows are also highly similar. The situation is similar to mobile application networks.

As shown in Figure 8, we compare transfer learning on the Tor dataset by introducing three pre-trained traffic classification methods (PERT, ET-BERT, and YaTC). The pre-trained models used are identical to those used in previous experiments (i.e., pre-trained using the other six datasets). MS-PreTE significantly increases the F1 score from 84.62% to 99.29%, an improvement of 14.67% over the scenario without pre-training. The effect of YaTC is also quite pronounced, with an improvement of 14.2% over the non-pre-trained scenario. Furthermore, both ET-BERT and PERT show weak improvements with pre-training, indicating that their pre-trained models struggle to transfer to new downstream traffic classification tasks. The results shows that MS-PreTE successfully captures generalizable representations of mobile application network traffic during pre-training, demonstrating strong transferability to other similar datasets.

5.7. Resource Consumption Analysis

To evaluate the resource consumption of MS-PreTE, we conducted a comparative analysis from two perspectives: FLOPs (floating-point operations per forward pass) and parameter count. As shown in Table 6, MS-PreTE has approximately 910 million FLOPs and 3.7 million parameters, placing its overall model complexity at a moderate level. Compared to ET-BERT, MS-PreTE significantly reduces computational cost—requiring only about 7.6% of ET-BERT’s FLOPs—while its parameter count is also substantially lower than ET-BERT’s 100 million. This reduction in computational overhead implies that MS-PreTE consumes fewer hardware resources during both training and inference, making it more suitable for deployment in resource-constrained environments. In addition, the modular design of MS-PreTE contributes to better control over model scale, thereby improving training efficiency and reducing the time and computational cost associated with the pre-training phase.

6. Discussion

Model Lightweighting

In our resource consumption analysis (Section 5.7), we evaluated the computational overhead of MS-PreTE. The results demonstrate comparable resource consumption to state-of-the-art deep learning methods, confirming its practical deployability on edge servers. However, for extreme resource-constrained scenarios (e.g., embedded devices, low-power terminals) with stringent computing and memory limitations, there remains room for further compression and optimization. Future work could incorporate model compression techniques such as pruning, knowledge distillation, and low-rank decomposition to enhance the model’s adaptability and real-time performance in mobile edge environments.

Interpretability

While the current model demonstrates excellent classification performance, its interpretability remains limited, making it difficult to explain decision-making rationale to external users or security auditors. In practical cybersecurity applications, model predictions often require certain levels of transparency and traceability to enhance trustworthiness and controllability in critical business systems. Our future research will incorporate Explainable AI (XAI) methodologies to analyze the model’s response patterns to traffic features, thereby improving interpretability and operational security assurance.

Adaptation Capability for Unknown Traffic Patterns

Although MS-PreTE employs a pre-training mechanism that theoretically provides certain generalization capabilities for handling unseen samples, its recognition accuracy may still degrade when encountering completely new encryption protocols or previously unknown mobile traffic patterns. To enhance the model’s adaptability to unknown patterns, our future work will explore incorporating incremental learning, transfer learning, or meta-learning approaches. These methods would enable the model to dynamically adapt to emerging traffic patterns without requiring complete retraining, thereby improving its stability and scalability in complex real-world network environments.

7. Conclusions

In this paper, we introduce MS-PreTE, a Multi-Scale encrypted traffic pre-training encoder that can learn latent traffic representations from large-scale unlabeled data and then accurately classify encrypted traffic for multiple scenarios with simple fine-tuning. The core of MS-PreTE includes traffic representation, multi-level encoder, and pre-training tasks. It adopts a multi-level and multi-channel traffic representation. Compared with ET-BERT and YaTC, this adjustment allows the input to retain the information of the original traffic to the greatest extent for subsequent models. In addition, MS-PreTE also proposes MRP and MTR pre-training tasks, which enable the encoder to give full play to its learning ability and learn unbiased representations with higher quality. This ultimately benefits the subsequent fine-tuning process of mobile application classification tasks. The focal-attention mechanism allows the model to distinguish similar traffic by focusing on the key traffic information in the flow. Experimental results show that we comprehensively evaluate the generalization and robustness of MS-PreTE on three mobile application datasets and four public datasets. In the future, we plan to deeply explore the ability and potential of MS-PreTE in processing and predicting new categories of samples.

Author Contributions

Conceptualization, Z.W. and Y.Q.; methodology, Z.W. and Y.Q.; software, Y.Q.; validation, Y.Q.; formal analysis, Y.Q.; investigation, Z.W. and Y.Q.; resources, Z.W. and Y.Q.; data curation, Z.W. and Y.Q.; writing—original draft preparation, Z.W.; writing—review and editing, Z.W.; visualization, Z.W., X.L. and Y.Q.; supervision, Y.L. and S.Z. project administration, Y.L. and S.Z.; funding acquisition, Y.L. and S.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by National Natural Science Foundation of China (62372124).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data are contained within the article.

Acknowledgments

The authors would like to thank all the anonymous reviewers for their insightful comments and constructive suggestions that have obviously improved the quality of this manuscript.

Conflicts of Interest

The authors declare that they have no known competing financial interest or personal circumstances that could have appeared to influence the work reported in this manuscript.

References

Papadogiannaki, E.; Ioannidis, S. A Survey on Encrypted Network Traffic Analysis Applications, Techniques, and Countermeasures. ACM Comput. Surv. 2021, 54, 1–35. [Google Scholar] [CrossRef]
Chen, Y.; You, W.; Lee, Y.; CHen, K.; Wang, X. Mass Discovery of Android Traffic Imprints through Instantiated Partial Execution. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, Dallas, TX, USA, 30 October–3 November 2017; pp. 815–828. [Google Scholar]
Dai, S.; Tongaonkar, A.; Wang, X.; Nucci, A.; Song, D. NetworkProfiler: Towards Automatic Fingerprinting of Android Apps. In Proceedings of the 2013 Proceedings IEEE INFOCOM, Turin, Italy, 14–19 April 2013; pp. 809–817. [Google Scholar]
Chen, Z.; Yu, B.; Zhang, Y.; Zhang, J.; Xu, J. Automatic Mobile Application Traffic Identification by Convolutional Neural Networks. In Proceedings of the 2016 IEEE Trustcom/BigDataSE/ISPA, Tianjin, China, 23–26 August 2016; pp. 301–307. [Google Scholar]
Choi, Y.; Chung, J.Y.; Park, B.; Hong, J. Automated Classifier Generation for Application-Level Mobile Traffic Identification. In Proceedings of the 2012 IEEE Network Operations and Management Symposium, Maui, HI, USA, 16–20 April 2012; pp. 1075–1081. [Google Scholar]
Liu, C.; He, L.; Xiong, G.; Cao, Z.; Li, Z. FS-Net: A Flow Sequence Network for Encrypted Traffic Classification. In Proceedings of the IEEE INFOCOM 2019—IEEE Conference on Computer Communications, Paris, France, 29 April–2 May 2019; pp. 1171–1179. [Google Scholar]
Sirinam, P.; Imani, M.; Juarez, M.; Wright, M. Deep Fingerprinting: Undermining Website Fingerprinting Defenses with Deep Learning. In Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security, Toronto, Canada, 15–19 October 2018; pp. 1928–1943. [Google Scholar]
Lin, K.; Xu, X.; Gao, H. TSCRNN: A Novel Classification Scheme of Encrypted Traffic Based on Flow Spatiotemporal Features for Efficient Management of IIoT. Comput. Netw. 2021, 190, 107974. [Google Scholar] [CrossRef]
Wang, W.; Zhu, M.; Wang, J.; Zeng, X.; Yang, Z. End-to-End Encrypted Traffic Classification with One-Dimensional Convolution Neural Networks. In Proceedings of the 2017 IEEE International Conference on Intelligence and Security Informatics (ISI), Beijing, China, 22–24 July 2017; pp. 43–48. [Google Scholar]
Maniwa, R.; Ichijo, N.; Nakahara, Y.; Matsushima, T. Boosting-Based Sequential Meta-Tree Ensemble Construction for Improved Decision Trees. arXiv 2024, arXiv:2402.06386. [Google Scholar]
Pu, F.; Wang, Y.; Ye, T.; Jin, Y.; Liu, Q.; Zou, Q. Detecting Unknown Encrypted Malicious Traffic in Real Time via Flow Interaction Graph Analysis. In Proceedings of the 2023 Network and Distributed System Security Symposium (NDSS), San Diego, CA, USA, 27 February–3 March 2023. [Google Scholar]
Lotfollahi, M.; Siavoshani, M.J.; Hossein Zade, R.S.; Saberian, M. Deep Packet: A Novel Approach for Encrypted Traffic Classification Using Deep Learning. Soft Comput. 2020, 24, 1999–2012. [Google Scholar] [CrossRef]
Rezaei, S.; Liu, X. Deep Learning for Encrypted Traffic Classification: An Overview. IEEE Commun. Mag. 2019, 57, 76–81. [Google Scholar] [CrossRef]
Wang, Q.; Qian, C.; Li, X.; Zhao, W.; Xu, K. Lens: A Foundation Model for Network Traffic in Cybersecurity. arXiv 2024, arXiv:2404.12345. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image Is Worth 16×16 Words: Transformers for Image Recognition at Scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
Lan, Z.; Chen, M.; Goodman, S.; Gimpel, K.; Sharma, P.; Soricut, R. ALBERT: A Lite BERT for Self-Supervised Learning of Language Representations. arXiv 2019, arXiv:1909.11942. [Google Scholar]
Liu, T.; Ma, X.; Liu, L.; Xu, W.; Zhu, Z.; Wang, Z. LAMBERT: Leveraging Attention Mechanisms to Improve the BERT Fine-Tuning Model for Encrypted Traffic Classification. Mathematics 2024, 12, 1624. [Google Scholar] [CrossRef]
He, H.Y.; Yang, Z.G.; Chen, X.N. PERT: Payload Encoding Representation from Transformer for Encrypted Traffic Classification. In Proceedings of the 2020 ITU Kaleidoscope: Industry-Driven Digital Transformation (ITU K), Ha Noi, Vietnam, 7–11 December 2020; pp. 1–8. [Google Scholar]
Lin, X.; Xiong, G.; Gou, G.; Li, Z.; Shi, J.; Yu, J. ET-BERT: A Contextualized Datagram Representation with Pre-Training Transformers for Encrypted Traffic Classification. In Proceedings of the WWW’22: The ACM Web Conference 2022, Lyon, France, 25–29 April 2022; pp. 633–642. [Google Scholar]
Zhao, R.; Zhan, M.; Deng, X.; Wang, Y.; Gui, G.; Xue, Z. Yet Another Traffic Classifier: A Masked Autoencoder Based Traffic Transformer with Multi-Level Flow Representation. Proc. AAAI Conf. Artif. Intell. 2023, 37, 5420–5427. [Google Scholar] [CrossRef]
Chen, X.; Ding, M.; Wang, X.; Xin, Y.; Mo, S.; Wang, Y.; Han, S.; Luo, P.; Zeng, G.; Wang, J. Context Autoencoder for Self-Supervised Representation Learning. Int. J. Comput. Vis. 2024, 132, 208–223. [Google Scholar] [CrossRef]
Zhao, Y.; Medvidovic, N. A Microservice Architecture for Online Mobile App Optimization. In Proceedings of the 2019 IEEE/ACM 6th International Conference on Mobile Software Engineering and Systems (MOBILESoft), Montreal, QC, Canada, 25 May 2019; pp. 45–49. [Google Scholar]
Aceto, G.; Ciuonzo, D.; Montieri, A.; Pescapè, A. Mobile Encrypted Traffic Classification Using Deep Learning: Experimental Evaluation, Lessons Learned, and Challenges. IEEE Trans. Netw. Serv. Manag. 2019, 16, 445–458. [Google Scholar] [CrossRef]
Van Ede, T.; Bortolameotti, R.; Continella, A.; Ren, J.; Dubois, D.; Lindorfer, M.; Choffnes, D.; Steen, M.; Peter, A. Flowprint: Semi-Supervised Mobile-App Fingerprinting on Encrypted Network Traffic. Proc. NDSS Symp. 2020, 27, 1909020. [Google Scholar]
Shen, M.; Liu, Y.; Zhu, L.; Xu, K.; Du, X.; Guizani, N. Optimizing Feature Selection for Efficient Encrypted Traffic Classification: A Systematic Approach. IEEE Netw. 2020, 34, 20–27. [Google Scholar] [CrossRef]
Taylor, V.F.; Spolaor, R.; Conti, M.; Martinovic, I. Robust Smartphone App Identification via Encrypted Network Traffic Analysis. IEEE Trans. Inf. Forensics Secur. 2017, 13, 63–78. [Google Scholar] [CrossRef]
Hayes, J.; Danezis, G. k-Fingerprinting: A Robust Scalable Website Fingerprinting Technique. In Proceedings of the 25th USENIX Security Symposium, Austin, TX, USA, 10–12 August 2016; pp. 1357–1374. [Google Scholar]
Taylor, V.F.; Spolaor, R.; Conti, M.; Berthome, P. Appscanner: Automatic Fingerprinting of Smartphone Apps from Encrypted Network Traffic. In Proceedings of the IEEE European Symposium on Security and Privacy (EuroS&P), Saarbruecken, Germany, 21–24 March 2016; pp. 439–454. [Google Scholar]
Al-Naami, K.; Chandra, S.; Mustafa, A.; Khan, L.; Lin, Z.; Hamlen, K.; Thuraisingham, B. Adaptive Encrypted Traffic Fingerprinting with Bi-Directional Dependence. In Proceedings of the 2016 Annual Computer Security Applications Conference, Los Angeles, CA, USA, 5–8 December 2016; pp. 177–188. [Google Scholar]
He, Y.; Li, W. Image-Based Encrypted Traffic Classification with Convolutional Neural Networks. In Proceedings of the 2020 IEEE Fifth International Conference on Data Science in Cyberspace (DSC), Hong Kong, China, 27–30 July 2020; pp. 271–278. [Google Scholar]
Okonkwo, Z.; Foo, E.; Li, Q.; Hou, Z. A CNN Based Encrypted Network Traffic Classifier. In Proceedings of the 2022 Australasian Computer Science Week, Brisbane, Australia, 14–18 February 2022; pp. 74–83. [Google Scholar]
Sengupta, S.; Ganguly, N.; De, P.; Chakraborty, S. Exploiting Diversity in Android TLS Implementations for Mobile App Traffic Classification. In Proceedings of the WWW ’19: The World Wide Web Conference, San Francisco, CA, USA, 13–17 May 2019; pp. 1657–1668. [Google Scholar]
Canard, S.; Diop, A.; Kheir, N.; Paindavoine, M.; Sabt, M. BlindIDS: Market-Compliant and Privacy-Friendly Intrusion Detection System over Encrypted Traffic. In Proceedings of the 2017 ACM on Asia Conference on Computer and Communications Security, Abu Dhabi, United Arab Emirates, 2–6 April 2017; pp. 561–574. [Google Scholar]
Wei, H.; Xie, R.; Cheng, H.; Feng, L.; Li, Y. Mitigating Neural Network Overconfidence with Logit Normalization. In Proceedings of the 39th International Conference on Machine Learning, Baltimore, MD, USA, 17–23 July 2022; pp. 23631–23644. [Google Scholar]
Zhao, S.; Chen, S.; Wang, F.; Wei, Z.; Zhong, J.; Liang, J. A Large-Scale Mobile Traffic Dataset for Mobile Application Identification. Comput. J. 2024, 67, 1501–1513. [Google Scholar] [CrossRef]
Wang, W.; Zhu, M.; Zeng, X.; Ye, X.; Sheng, Y. Malware Traffic Classification Using Convolutional Neural Network for Representation Learning. In Proceedings of the 2017 International Conference on Information Networking (ICOIN), Da Nang, Vietnam, 11–13 January 2017; pp. 712–717. [Google Scholar]
Gil, G.D.; Lashkari, A.H.; Mamun, M.; Ghorbani, A. Characterization of Encrypted and VPN Traffic Using Time-Related Features. In Proceedings of the 2nd International Conference on Information Systems Security and Privacy (ICISSP), Rome, Italy, 19–21 February 2016; pp. 407–414. [Google Scholar]
Lashkari, A.H.; Gil, G.D.; Mamun, M.; Ghorbani, A. Characterization of Tor Traffic Using Time Based Features. In Proceedings of the 3rd International Conference on Information Systems Security and Privacy, Lisbon, Portugal, 22–24 February 2017; pp. 253–262. [Google Scholar]
Neto, E.C.P.; Dadkhah, S.; Ferreira, R.; Zohourian, A.; Lu, R.; Ghorbani, A. CICIoT2023: A Real-Time Dataset and Benchmark for Large-Scale Attacks in IoT Environment. Sensors 2023, 23, 5941. [Google Scholar] [CrossRef]
Xie, G.; Li, Q.; Jiang, Y.; Dai, T.; Shen, G.; Li, R.; Sinnott, R.; Xia, S. SAM: Self-Attention Based Deep Learning Method for Online Traffic Classification. In Proceedings of the Workshop on Network Meets AI & ML, Virtual, 10–14 August 2020; pp. 14–20. [Google Scholar]
Shen, M.; Zhang, J.; Zhu, L.; Xu, K.; Du, X. Accurate Decentralized Application Identification via Encrypted Traffic Analysis Using Graph Neural Networks. IEEE Trans. Inf. Forensics Secur. 2021, 16, 2367–2380. [Google Scholar] [CrossRef]
Zhang, H.; Yu, L.; Xiao, X.; Li, Q.; Mercaldo, F.; Luo, X.; Liu, Q. TFE-GNN: A Temporal Fusion Encoder Using Graph Neural Networks for Fine-Grained Encrypted Traffic Classification. In Proceedings of the ACM Web Conference 2023, Austin, TX, USA, 30 April–4 May 2023; pp. 2066–2075. [Google Scholar]
Panchenko, A.; Lanze, F.; Pennekamp, J.; Wehrle, K.; Engel, T. Website Fingerprinting at Internet Scale. Proc. NDSS 2016, 1, 23477. [Google Scholar]
Wang, X.; Chen, S.; Su, J. App-Net: A Hybrid Neural Network for Encrypted Mobile Traffic Classification. In Proceedings of the IEEE INFOCOM 2020—IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS), Toronto, ON, Canada, 6–9 July 2020; p. 23477. [Google Scholar]
Schuster, R.; Shmatikov, V.; Tromer, E. Beauty and the Burst: Remote Identification of Encrypted Video Streams. In Proceedings of the 26th USENIX Security Symposium, Vancouver, BC, Canada, 16–18 August 2017; pp. 1357–1374. [Google Scholar]
Yang, L.; Li, Y.; Song, H.; Lv, Y.; Liu, J. Flow-Based Encrypted Network Traffic Classification with Graph Neural Networks. IEEE Trans. Netw. Serv. Manag. 2023, 20, 1504–1516. [Google Scholar] [CrossRef]

Figure 1. Threat models of mobile network traffic classification.

Figure 2. Schematic Diagram of the Composition of Encrypted Traffic Packets.

Figure 3. MS-PreTE Architecture Overview.

Figure 4. Pre-training Process.

Figure 5. Comparison of loss and accuracy curves on the test set across seven experimental settings.

Figure 6. Confusion matrix for the NUDT-Mobile-Traffic dataset.

Figure 7. Comparison of loss and accuracy curves across seven experimental settings.

Figure 8. Performance comparison with three other pre-training methods on the ISCX-Tor2016 dataset.

Table 1. Statistical Information of the Datasets.

Task	Dataset	Label	Flow	Packet
ETCGA	Cross-Platform (Android) [25]	215	0.68 M	3.4 M
ETCGA	Cross-Platform (iOS) [25]	196	0.50 M	2.5 M
ETCGA	NUDT-Mobile-Traffic [36]	350	2.88 M	14.4 M
ETCMA	USTC-TFC-2016 [37]	20	4.8 K	24 K
ETCV	ISCX-VPN-2016 [38]	7	2.8 K	14 K
ETCT	ISCX-Tor-2016 [39]	8	3.6 K	18 K
TCIoT	CIC-IoT-2023 [40]	33	3.5 K	17.5 K

Table 2. Comparison Results on ETCGA Task.

Method	Cross-Platform (iOS)				Cross-Platform (Android)				NUDT-Mobile-Traffic
Method	AC	PR	RC	F1	AC	PR	RC	F1	AC	PR	RC	F1
AppScanner	0.4219	0.2991	0.2628	0.2638	0.4365	0.4847	0.4701	0.4767	0.6426	0.6446	0.6118	0.6202
CUMUL	0.3077	0.1774	0.1810	0.1675	0.3098	0.3783	0.3818	0.3307	0.5780	0.5528	0.5437	0.5354
BIND	0.5281	0.4007	0.3712	0.3738	0.6076	0.6040	0.5495	0.5535	0.7587	0.7650	0.7365	0.7480
Beauty	0.1878	0.0606	0.0601	0.0509	0.2794	0.2799	0.2172	0.1834	0.1806	0.1827	0.1233	0.1121
FS-Net	0.4728	0.3710	0.3369	0.3359	0.4763	0.4635	0.4196	0.4291	0.6276	0.6154	0.5821	0.5847
AppNet	0.3971	0.3399	0.2774	0.2855	0.4050	0.3600	0.3350	0.3263	0.6265	0.6457	0.6038	0.6165
DF	0.3390	0.2065	0.1920	0.1856	0.3337	0.2477	0.2682	0.2671	0.5633	0.6024	0.5260	0.5402
SAM	0.9572	0.9805	0.9384	0.9562	0.9048	0.8899	0.9129	0.8999	0.1133	0.1895	0.1085	0.0969
GraphDApp	0.2541	0.1925	0.1530	0.1576	0.2762	0.2113	0.1871	0.1781	0.5815	0.5897	0.5445	0.5567
TFE-GNN	0.3472	0.2657	0.2517	0.2307	0.3269	0.3027	0.2859	0.2785	0.0833	0.0541	0.0831	0.0444
FB-GNN	0.3469	0.2342	0.2461	0.2400	0.3373	0.2977	0.2761	0.2600	0.0815	0.0573	0.0791	0.0431
PERT	0.8380	0.8440	0.7615	0.7879	0.7323	0.7279	0.6909	0.6928	0.7863	0.7980	0.7588	0.7697
ET-BERT	0.9663	0.9669	0.9250	0.9370	0.9221	0.8874	0.7908	0.7994	0.2222	0.1003	0.0247	0.0366
YaTC	0.9736	0.9753	0.9736	0.9728	0.9707	0.9738	0.9707	0.9709	0.8502	0.8600	0.8502	0.8523
MS-PreTE	0.9942	0.9957	0.9942	0.9934	0.9881	0.9877	0.9880	0.9861	0.8760	0.8780	0.8760	0.8770

Bold indicates the best result.

Table 3. Comparison Results on ETCV, TCIoT, and ETCMA Tasks.

Method	ISCX-VPN-2016				CIC-IoT-2023				USTC-TFC-2016
Method	AC	PR	RC	F1	AC	PR	RC	F1	AC	PR	RC	F1
AppScanner	0.7127	0.7425	0.6473	0.6470	0.5365	0.5847	0.4701	0.4767	0.8750	0.8779	0.8333	0.8385
CUMUL	0.5769	0.4524	0.5157	0.4670	0.5098	0.4783	0.4818	0.4307	0.7162	0.6606	0.6541	0.6422
BIND	0.7341	0.7209	0.6204	0.6569	0.6076	0.6040	0.5495	0.5535	0.8835	0.8791	0.8403	0.8468
Beauty	0.3951	0.3434	0.2046	0.1959	0.2794	0.2799	0.2172	0.1834	0.6097	0.5352	0.3899	0.3815
FS-Net	0.6829	0.6013	0.6094	0.6024	0.7763	0.5635	0.4196	0.4291	0.7863	0.5983	0.5968	0.5968
AppNet	0.5822	0.4918	0.4936	0.4850	0.5050	0.4600	0.4350	0.4263	0.8878	0.8160	0.8154	0.8120
DF	0.6230	0.5442	0.4940	0.5061	0.5337	0.5477	0.4682	0.4671	0.7055	0.5879	0.5537	0.5415
SAM	0.8432	0.8653	0.7913	0.8190	0.7048	0.3899	0.3129	0.2999	0.8610	0.8973	0.8729	0.8721
GraphDApp	0.5305	0.4841	0.4460	0.4587	0.4762	0.4113	0.3871	0.3781	0.8654	0.9123	0.8689	0.8726
TFE-GNN	0.6900	0.6600	0.6001	0.6080	0.3269	0.3027	0.2859	0.2785	0.9167	0.8264	0.8250	0.8245
FB-GNN	0.6825	0.6551	0.6003	0.6124	0.5041	0.4816	0.4553	0.4627	0.5820	0.5602	0.5410	0.5465
PERT	0.7265	0.6474	0.6227	0.6249	0.7323	0.7279	0.6909	0.6928	0.9748	0.9798	0.9754	0.9746
ET-BERT	0.9025	0.8877	0.8818	0.8806	0.9221	0.8874	0.7908	0.7994	0.8984	0.9243	0.9009	0.9054
YaTC	0.9650	0.9653	0.9650	0.9647	0.9610	0.9630	0.9610	0.9594	0.9693	0.9707	0.9693	0.9692
MS-PreTE	0.9685	0.9689	0.9685	0.9680	0.9587	0.9598	0.9587	0.9586	0.9756	0.9756	0.9756	0.9755

Bold indicates the best result.

Table 4. Ablation Study on ETCGA Tasks.

Method	Cross-Platform (iOS)		Cross-Platform (Android)		NUDT-Mobile-Traffic
Method	AC	F1	AC	F1	AC	F1
MS-PreTE	0.9942	0.9934	0.9881	0.9861	0.8760	0.8770
Single Encoder	0.9773	0.9782	0.9755	0.9753	0.8463	0.8462
w/o Focal-Attention	0.9854	0.9829	0.9789	0.9784	0.8583	0.8582
w/o FRM	0.9891	0.9892	0.9817	0.9812	0.8681	0.8680
w/o Pre-train	0.9539	0.9510	0.9562	0.9549	0.8352	0.8315

Table 5. Ablation Study on ETCV, TCIoT, and ETCMA Tasks.

Method	Cross-Platform (iOS)		Cross-Platform (Android)		NUDT-Mobile-Traffic
Method	AC	F1	AC	F1	AC	F1
MS-PreTE	0.9942	0.9934	0.9881	0.9861	0.8760	0.8770
Single Encoder	0.9773	0.9782	0.9755	0.9753	0.8463	0.8462
w/o Focal-Attention	0.9854	0.9829	0.9789	0.9784	0.8583	0.8582
w/o FRM	0.9891	0.9892	0.9817	0.9812	0.8681	0.8680
w/o Pre-train	0.9539	0.9510	0.9562	0.9549	0.8352	0.8315

Table 6. Model FLOPs and Parameters.

Model	FLOPs (M)	Parameters (M)
GraphDApp	$3.6 \times 10^{2}$	$1.4 \times 10^{1}$
TFE-GNN	$2.2 \times 10^{3}$	$4.4 \times 10^{1}$
PERT	$2.0 \times 10^{5}$	$3.7 \times 10^{2}$
ET-BERT	$1.2 \times 10^{4}$	$1.0 \times 10^{2}$
YaTC	$9.1 \times 10^{2}$	$3.7 \times 10^{0}$
MS-PreTE	$9.1 \times 10^{2}$	$3.7 \times 10^{0}$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, Z.; Qiu, Y.; Liu, Y.; Zhang, S.; Liu, X. MS-PreTE: A Multi-Scale Pre-Training Encoder for Mobile Encrypted Traffic Classification. Big Data Cogn. Comput. 2025, 9, 216. https://doi.org/10.3390/bdcc9080216

AMA Style

Wang Z, Qiu Y, Liu Y, Zhang S, Liu X. MS-PreTE: A Multi-Scale Pre-Training Encoder for Mobile Encrypted Traffic Classification. Big Data and Cognitive Computing. 2025; 9(8):216. https://doi.org/10.3390/bdcc9080216

Chicago/Turabian Style

Wang, Ziqi, Yufan Qiu, Yaping Liu, Shuo Zhang, and Xinyi Liu. 2025. "MS-PreTE: A Multi-Scale Pre-Training Encoder for Mobile Encrypted Traffic Classification" Big Data and Cognitive Computing 9, no. 8: 216. https://doi.org/10.3390/bdcc9080216

APA Style

Wang, Z., Qiu, Y., Liu, Y., Zhang, S., & Liu, X. (2025). MS-PreTE: A Multi-Scale Pre-Training Encoder for Mobile Encrypted Traffic Classification. Big Data and Cognitive Computing, 9(8), 216. https://doi.org/10.3390/bdcc9080216

Article Menu

MS-PreTE: A Multi-Scale Pre-Training Encoder for Mobile Encrypted Traffic Classification

Abstract

1. Introduction

2. Related Works

2.1. Machine Learning Based Traffic Classification Methods

2.2. Deep Learning Based Traffic Classification Methods

2.3. Pre-Training Based Traffic Classification Methods

3. Observation and Motivation

4. MS-PreTE

4.1. Preprocessing

4.2. Flow Representation Model

4.3. Pre-Training Phase

4.4. Fine-Tuning Phase

5. Experiments

5.1. Experiment Setup

5.1.1. Datasets

Encrypted Traffic Classification on the General Application (ETCGA) Task

Encrypted Traffic Classification on the Malware Application (ETCMA) Task

Encrypted Traffic Classification on VPN (ETCV) Task

Encrypted Traffic Classification on Tor (ETCT) Task

Traffic Classification on the Malware Application (TCIoT) Task

5.1.2. Implementation Details

5.1.3. Evaluation Metrics

5.2. Classification Performance

5.3. Comparison with State-of-the-Art Methods

5.4. Ablation Study

5.5. t-SNE Visualization on Latent Feature Representations

5.6. Transfer Learning Analysis

5.7. Resource Consumption Analysis

6. Discussion

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI