MT-FBERT: Malicious Traffic Detection Based on Efficient Federated Learning of BERT

Tang, Jian; Huang, Zhao; Li, Chunqiang

doi:10.3390/fi17080323

Open AccessArticle

MT-FBERT: Malicious Traffic Detection Based on Efficient Federated Learning of BERT

by

Jian Tang

^1,*,

Zhao Huang

¹ and

Chunqiang Li

²

¹

China International Water & Electric Corp., Beijing 101116, China

²

Computer School, Beijing Information Science & Technology University, Beijing 100101, China

^*

Author to whom correspondence should be addressed.

Future Internet 2025, 17(8), 323; https://doi.org/10.3390/fi17080323

Submission received: 30 May 2025 / Revised: 27 June 2025 / Accepted: 28 June 2025 / Published: 23 July 2025

Download

Browse Figures

Versions Notes

Abstract

The rising frequency of network intrusions has significantly impacted critical infrastructures, leading to an increased focus on the detection of malicious network traffic in recent years. However, traditional port-based and classical machine learning-based malicious network traffic detection methods suffer from a dependence on expert experience and limited generalizability. In this paper, we propose a malicious traffic detection method based on an efficient federated learning framework of Bidirectional Encoder Representations from Transformers (BERT), called MT-FBERT. It offers two major advantages over most existing approaches. First, MT-FBERT pretrains BERT using two pre-training tasks along with an overall pre-training loss on large-scale unlabeled network traffic, allowing the model to automatically learn generalized traffic representations, which do not require human experience to extract the behavior features or label the malicious samples. Second, MT-FBERT finetunes BERT for malicious network traffic detection through an efficient federated learning framework, which both protects the data privacy of critical infrastructures and reduces resource consumption by dynamically identifying and updating only the most significant neurons in the global model. Evaluation experiments on public datasets demonstrated that MT-FBERT outperforms state-of-the-art baselines in malicious network traffic detection.

Keywords:

malicious traffic detection; deep learning; federated learning; pre-training; fine-tuning; BERT

1. Introduction

The China International Water and Electric Corporation is an enterprise focused on water conservancy, hydropower engineering, and infrastructure construction. As the world’s largest hydropower engineering contractor, the corporation engages in the construction of large-scale hydropower stations, reservoirs, irrigation systems, and other projects internationally, with operations spanning Asia, Africa, South America, and beyond. The company manages numerous hydropower station management systems and holds extensive operational data from these facilities. Given that hydropower facilities represent critical infrastructure for host countries, ensuring the cybersecurity of critical information infrastructure sector has become essential for safeguarding national cyberspace security, particularly in the context of today’s complex and volatile international landscape [1,2,3,4,5,6]. The types, scale, and complexity of cyber threats targeting critical infrastructure significantly differ from traditional IT-based systems [7]: (1) Attacks on critical infrastructure are more intricate than conventional cyber threats [8,9,10,11,12,13]. (2) Overseas power stations present a specific vulnerability to exploitation [14,15,16]. (3) Cross-border data transfers require exceptionally high security standards [17,18,19]. Therefore, the establishment of robust cybersecurity defense mechanisms for critical infrastructure is an urgent imperative.

Malicious traffic detection is a vital method for identifying network attacks. Developing effective malicious traffic detection methods has become a critical task in maintaining the normal operations of critical infrastructure’s network. Malicious network traffic detection methods can be classified into three categories. Firstly, rule-based detection methods identify malicious traffic according to pre-defined traffic signatures (e.g., patterns and keywords in the data packets from the payload). These methods achieve high detection accuracy for the known attacks but fail to detect the unknown attacks and attack variants. Secondly, traditional machine learning-based detection methods extract statistical features and employ classical machine learning algorithms to detect malicious traffic. These approaches can effectively identify variant and unknown malicious traffic, but they highly rely on expert experience for feature selection. Thirdly, deep learning-based detection methods automatically learn complex behavioral features using deep neural networks. However, most of these methods require a vast amount of labelled training data, while large-scale and high-quality labeled datasets are scarce.

Recently, pre-training-based methods have shown promise in addressing the problem of limited labeled data. These methods learn general data representations from large, unlabeled datasets. Such data representations can be transferred to the specific downstream tasks by fine-tuning on limited labeled data. Pre-training-based methods have demonstrated outstanding performance in natural language processing (NLP) and Computer Vision (CV) tasks. However, their application in malicious network traffic detection by applying pre-training techniques remains limited. In addition, most existing pre-training models are trained and fine-tuned in centralized learning techniques. The fine-tuning datasets consist of labeled samples, but collecting high-quality labeled malicious network traffic datasets centrally raises privacy concerns.

Federated learning (FL) is an emerging deep learning framework. It facilitates data sharing of multiple organizations to collaborate without uploading data to a central server, thereby addressing data silos and enhancing data privacy. Whereas, training complex models within the FL framework incurs high communication and computational costs, as client devices applied in one organization often possess limited hardware resources and network bandwidth, especially when dealing with pre-trained language models.

In this paper, we focus on the pre-training-based method for malicious traffic detection in an FL framework by addressing the following questions: (1) Can we pre-train the BERT and fine-tune it for malicious network traffic detection across multiple organizations using an FL framework? (2) Can we achieve an excellent malicious traffic detection performance while using cost-effective model training in resource-constrained clients in an FL framework?

In order to address these questions, we propose a malicious traffic detection model based on an efficient federated learning framework of BERT, called MT-FBERT. MT-FBERT firstly pretrains BERT using self-supervised learning on large-scale, unlabeled network traffic data to learn generic traffic representations. Specifically, it employs two pre-training tasks separately, and the overall pre-training task loss. Then, MT-FBERT finetunes the pre-trained model using FL by creating small local models for each client. In order to consume fewer computational resources and transmit fewer weights, only a subset of crucial neurons in the global model is selected and updated.

The main contributions of this paper are as follows:

We propose a novel pre-training model for malicious network traffic detection, called MT-FBERT. It leverages BERT to take advantage of the unlabeled traffic data to learn generic traffic patterns without relying on expert experience. And then the fine-tuning for malicious traffic detection in the FL framework can assure both the data privacy and the detection accuracy for multiple organizations.
We introduce an efficient malicious traffic fine-tuning mechanism with FL for MT-FBERT on labeled malicious traffic data. It selects and updates the important neurons of the global model in each client to save computational resources and transmission bandwidth.
We conduct experiments on several public datasets, which demonstrate the excellent malicious traffic detection performance of MT-FBERT by comparing it to the state-of-the-art baselines with multiple evaluation metrics. Under the conditions of limited labeled samples, distribution shifts, or constrained computational resources, MT-FBERT consistently performs with stability and efficiency.

2. Related Works

2.1. Malicious Traffic Detection

Malicious traffic detection methods can be categorized into three main approaches: rule-based, machine learning-based, and deep learning-based methods.

In the early stage, researchers mainly used rule-based detection methods, which involved manually constructing a rule base using expert knowledge. The rule base is primarily constructed from textual information, including traffic ports, Uniform Resource Locators (URLs), and traffic payload data. McPherson et al. [20] distinguished various application types by checking TCP/UDP port numbers of the traffic transport layer, such as port 443 for HTTPS traffic. Korczy’nski et al. [21] extracted plaintext information from the packet payload of TLS traffic and employed a first-order Markov chain to identify the application and anomalous communications. Ning et al. [22] proposed Deep Packet Inspection (DPI) rules to enhance privacy and security during the traffic detection process in a network middlebox. Rule-based methods offer the advantages of rapid identification and high accuracy. However, the development of these rules incurs significant costs, and the rule base must be continuously updated to maintain rule flexibility and effectiveness.

To effectively analyze complex traffic, Machine Learning (ML) algorithms are utilized to examine the high-dimensional statistical features of the traffic. Machine learning-based detection methods manually design features through expert prior knowledge. Panchenko et al. [23] selected 104 optimal statistical features, which were subsequently used as input for a Support Vector Machine (SVM) to identify website traffic. Shekhawat et al. [24] extracted 38 features, including fingerprint information, cipher suite details, and connection characteristics. These features were analyzed using SVM, Random Forest (RF), and Extreme Gradient Boosting (XGBoost), resulting in high detection accuracy. Cai et al. [25] designed a hybrid feature selection algorithm to remove redundant statistical features and applied the XGBoost algorithm to detect anonymous network traffic. ML-based methods are capable of analyzing complex traffic. However, the process of constructing statistical features heavily relies on expert knowledge and requires the selection of optimal features tailored to specific scenarios. These limitations often lead to issues such as overfitting and poor generalization ability.

Deep learning methods utilize self-learned features, avoiding manual engineering, thereby addressing the limitations inherent in machine learning-based detection approaches. Wang et al. [26] converted traffic into grayscale images and then used a convolutional neural network (CNN) to identify traffic images for encrypted traffic classification. Zhang et al. [27] proposed an enhanced multi-channel CNN classifier that utilizes multiple channels to enrich feature information and enhance analysis efficiency. Lotfollahi et al. [28] integrated a Stacked Autoencoder (SAE) model with a CNN to develop a “Deep Packet” framework. In this framework, the SAE model distinguishes between normal and abnormal traffic, while the CNN performs fine-grained classification of applications. Lin et al. [29] combined a CNN and a Recurrent Neural Network (RNN) to learn the temporal characteristics from time-related packets. The CNN captures spatial features of the traffic, while the RNN extracts temporal features. However, deep learning-based detection methods require large volumes of labeled traffic data, and collecting and manually annotating sufficient malicious traffic samples remains a significant challenge. When the available labeled traffic data is limited, the performance of the detection model is significantly compromised.

The pre-training model, based on transformers [30,31,32], is promising in addressing the problem of limited labeled data. It consists of two stages: pre-training and fine-tuning. During the pre-training stage, unlabeled data is used to acquire general knowledge through self-supervised learning. In the fine-tuning stage, labeled data is employed to learn task-specific knowledge using supervised learning. In natural language processing tasks, BERT [31] is the most widely used to implement pre-trained models by the cloze task and next sentence prediction. The widespread applications of BERT demonstrate its capability to leverage unlabeled natural language text to learn robust feature representations on limited labeled natural language text [33,34]. In the traffic analysis tasks, Wang et al. [35] utilize the unidirectional Mamba architecture for traffic sequence modeling and develop a comprehensive representation scheme for traffic data. Zhao et al. [36] adopted a Masked Autoencoder (MAE) based traffic transformer to learn latent representations through the reconstruction task. Hang et al. [37] introduced a Masked Patch Model to capture unbiased representations from the traffic with varying lengths and patterns. However, most existing pre-training models are trained and fine-tuned using centralized learning techniques, where the fine-tuning datasets consist of labeled samples. Centralized collection of high-quality labeled malicious network traffic datasets raises significant privacy concerns. Our proposed MT-FBERT can tackle these challenges effectively.

2.2. Federated Learning

FL was first introduced by Google in 2016 [38]. It is an emerging deep learning paradigm that facilitates data sharing without requiring data to be uploaded to a central server [39], thereby addressing data silos and enhancing data privacy [40].

In recent years, FL has been applied to the task of network traffic classification. Hyunsu et al. [41] proposed FLIC, a federated-learning traffic classification protocol that can distinguish new applications in real-time without compromising privacy. He et al. [42] proposed an Edge Device Identification (EDI) method based on FL. This method has a faster training speed and shorter training time than the centralized learning method. Majeed et al. [43] presented a cross-silo horizontal FL scheme to classify the traffic based on supervised feature-based deep learning and proved that the federated model is also comparable to the centralized training approach. Guo et al. [44] proposed a federated approach for privacy-preserving network traffic classification in heterogeneous environments. This approach incorporates a client selection algorithm designed to proactively and intelligently identify clients with low skewness, enabling their participation in model aggregation within the federated learning framework. The aforementioned methods, which combine FL with supervised learning, depend on labeled samples for network traffic classification. In addition, these FL training methods focus on the smaller datasets and the use of relatively small network models for training.

Recently, the technology integrating FL with pre-trained models has been proposed [45]. On one hand, this technology can ensure user privacy and effectively utilize more training data. On the other hand, pre-trained learning can alleviate the dependence on labeled data. Yet, current FL training methods typically require substantial communication and computational resources [46,47]. This renders training complex models within the FL framework exceptionally challenging, particularly for pre-trained models that have achieved remarkable progress in natural language processing tasks, such as GPT [32] and BERT [31].

To enable more complex models to adapt to the limited hardware resources and communication bandwidth of federated learning clients, FedSplitBERT [46] proposes a novel framework that reduces communication costs by partitioning BERT encoder layers into local and global components. Parameters of the local part are trained exclusively by local clients, while global part parameters are optimized through gradient aggregation across multiple clients. FedPETuning [47] investigates parameter-efficient tuning methods to alleviate the communication overhead of pre-trained models in federated learning, which keep most pre-trained model parameters frozen and only fine-tune additional lightweight parameters or a subset of parameters for downstream tasks. These methods enhance efficiency by designing the parameters for model updates in federated learning. However, these approaches primarily focus on downstream task fine-tuning and show limited adaptability to pre-training tasks [48]. Additionally, their effectiveness in the malicious network traffic detection task remains unapplied and uninvestigated. It is critical to incorporate network packet data analysis components for extracting features from network traffic in malicious network traffic detection.

Given the uniqueness of network packet data analysis and the requirement for high accuracy via pre-training and fine-tuning, sharing actual network traffic data for training purposes may raise privacy concerns. We propose MT-FBERT for malicious network traffic detection, an efficient, privacy-preserving architecture specifically designed to address these considerations.

3. MT-FBERT

3.1. Framework

The proposed MT-FBERT contains two stages: pre-training for learning generic traffic representations with large-scale unlabeled data, and privacy-preserving fine-tuning for adjusting the pre-trained model for malicious traffic detection. The overall framework of MT-EFBERT is illustrated in Figure 1.

In the pre-training stage, the unlabeled traffic flows are transformed into vectors by an encoding layer, which contains token semantic encoding, word position encoding, and word segment encoding. Subsequently, these encoded vectors are then passed through a representation layer, which consists of a sequence of transformer layers. Each transformer layer (T-Layer) comprises a multi-head attention mechanism and a feedforward neural network. The attention mechanism establishes connections between each token to enable its representation by combining each token with other tokens. Finally, the token vectors are input into the task layer. The tasks in the pre-training stage include Masked Burst Modeling (MBM) and the Next Burst Prediction (NBP) tasks. The model computes gradients to optimize its parameters based on the losses received from these two tasks.

In the malicious traffic detection stage, the model is initialized with the parameters obtained from the pre-trained stage, and then the federated learning method is adopted to collaboratively train the detection model among multiple organizations on labeled malicious traffic data. In order to save the communication cost of federated learning, there are two key modules: neuron compression and knowledge distillation, which are designed to compress the parameters of the model. The neuron compression performs layer compression in each transformer layer. The knowledge distillation conducts layer-level and neuron-level distillation before and during fine-tuning, respectively.

3.2. The Pre-Training Stage

3.2.1. Traffic Data Pre-Processing

Network traffic differs greatly from natural language and images in that it does not contain human-understandable content and explicit semantic units. In order to effectively leverage the pretraining technique for traffic representation, traffic is transformed into a pattern-preserved token unit for pre-training. Figure 2 illustrates an example of the traffic data pre-processing.

The traffic data is initially segmented into multiple flows, with each flow subsequently divided into multiple bursts. A flow is defined by a 5-tuple, comprising the source and destination IP addresses, source and destination ports, and the protocol. A burst is defined as a sequence of consecutive packets originating from either the request or the response within a single flow. Each packet is represented as a string of hexadecimal numbers. These hexadecimal strings are encoded in bi-grams, where each unit consists of two adjacent bytes, and each byte is represented as a 4-digit hexadecimal string. The Byte Pair Encoding (BPE) algorithm is subsequently applied for token representation, where each token unit is a 4-digit hexadecimal string with a value ranging from 0 to 65,535. The vocabulary size is specified as a maximum of 65,536. In addition, special tokens such as [CLS], [SEP], [PAD], and [MASK] are added for training tasks. [CLS] is added to the beginning of every sequence as the first token. [SEP] is used to separate sequences. [PAD] pads sequences to the maximum length to satisfy the requirement. [MASK] is used for masked language modeling tasks during pre-training to learn the context of the traffic.

Each token is represented by three types of embeddings: token embedding, position embedding, and segment embedding. Token embedding refers to the representation of tokens learned from a sequence of 4-digit hexadecimal string token units, as illustrated in Figure 2. The final embedding dimension of the token is set to 768. Position embedding enables the model to learn temporal relationships between tokens via relative positioning. We assign a 768-dimensional vector to each token to encode its sequential position information. As shown in Figure 2, a burst is equally divided into two segments, which are distinguished by the [SEP] token. The segment embedding is to indicate whether a token belongs to the first segment or the second segment. And the segment embedding dimension is set to 768. The final token representation is constructed by summing these three embeddings as shown in Figure 3. Initially, the embeddings are randomly initialized, where the embedding dimension is 768, and subsequently fine-tuned through multiple iterations of transformer encoding.

3.2.2. The Pre-Training Tasks

Two pre-training tasks are employed: Masked Burst Modeling (MBM) and Next Burst Prediction (NBP). These tasks are designed to capture the contextual relationships between traffic tokens.

The MBM task is similar to the masked language modeling and Masked Image Modeling utilized by BERT [31]. In the MBM task, the traffic token sequence of the input is masked, and the model is required to recover the masked tokens based on the contextual tokens. As shown in Figure 4, each original burst token sequence is first randomly masked to obtain the masked input. During the masking, each token in the original sequence is randomly chosen with 15% probability. As the chosen token from each sequence, we replace it with [MASK] at 80% chance, or choose a random token to replace it at 10% chance, or leave it unchanged at 10% chance, respectively. The model then encodes the masked token sequence with the encoding layer, which consists of three embedding layers, and the vector representation of tokens is obtained by summing up these embeddings. The vectors of tokens are further input into the representation layer, which includes several encoder layers of a transformer to learn the intermediate representations of tokens by using the contextual information. The intermediate representations are processed through a mapping network to predict the masked tokens based on the contextual information; meanwhile, the prediction probabilities of the tokens can be generated. Assume the sequence as

S = {\{{t o k e n}_{i}\}}_{i = 1}^{n}

, we mask

k

tokens and

{M A S K}_{i}

represents the masked token at the

i_{t h}

position in the sequence

S

. The contextual information of the

i_{t h}

position in the sequence

S

is set as

{C o n t e x t}_{i} = \{{t o k e n}_{1}, \dots, {t o k e n}_{i - 1}, {t o k e n}_{i + 1}, \dots, {t o k e n}_{n}\}

. The loss function of the MBM task can be defined as

{l o s s}_{M B M} = - \sum_{i = 1}^{k} l o g (P ({M A S K}_{i} = {t o k e n}_{i} |{C o n t e x t}_{i})),

(1)

The NBP task is similar to the Next Sentence Prediction utilized by BERT. A burst is a sequence of consecutive packets originating from the same direction. We learn the correlation between packets in a burst based on the NBP task. In the NBP task, a burst is equally divided into two segments, which can be denoted as

{sub - burst}_{a}

and

{sub - burst}_{b}

. For instance, as shown in Figure 2, “Burst 1” is represented by six traffic tokens, with the first three tokens comprising

{sub - burst}_{a}

and the remaining three tokens forming

{sub - burst}_{b}

. The NBP task is designed to predict whether two given segments belong to the same burst while simultaneously learning traffic representations by modeling packet-level dependencies within bursts. Specifically, we select

{sub - burst}_{a}

and

{sub - burst}_{b}

as input, and concatenate the two segments end to end, with [SEP] connecting them in the middle. During 50% of the time, the chosen

{sub - burst}_{b}

is the true next segment of

{sub - burst}_{a}

. During the remaining 50% of the time,

{sub - burst}_{b}

is randomly selected as long as it is not the next segment of

{sub - burst}_{a}

. Formally, the NBP task constitutes a binary classification problem, where the goal is to determine whether two segments belong to the same burst. Assume there are

n

bursts, denoting as

B = {\{({sub - burst}_{a}^{j}, {sub - burst}_{b}^{j})\}}_{j = 1}^{n}

. The loss function of the NBP task can be defined as

{l o s s}_{N B P} = - \sum_{j = 1}^{n} l o g (P (y_{j} |({sub - burst}_{a}^{j}, {sub - burst}_{b}^{j}))),

(2)

where

y_{j}

equals 0 if

{sub - burs t}_{b}^{j}

is the next segment of

{sub - burst}_{a}^{j}

, and 1 otherwise.

Overall, the loss function of the pre-training task combines the above two loss terms, which is defined as

{l o s s = {l o s s}_{M B M} + l o s s}_{N B P},

(3)

3.3. The Malicious Traffic Detection Stage

In the malicious traffic detection stage, the model is initialized with the parameters of the pre-trained model, and then the FL method is adopted to collaboratively train the detection model among multiple organizations on labeled malicious traffic data, which can effectively protect the data privacy of the organizations. Moreover, given the computational and communication overhead of pre-training models in federated learning, we introduce an optimized framework featuring neuron compression and knowledge distillation.

The neuron compression module utilizes magnitude-based pruning [49] to selectively retain a subset of neurons within the transformer layers. Each transformer layer comprises two primary sub-layers: a multi-head attention (MHA) mechanism and a feed-forward network (FFN). The FFN contains more parameters than the MHA [50]. We implement the neuron compression by removing neurons with low weights in the FFN. The FFN comprises three sequential components: (1) an input fully-connected layer, (2) a ReLU activation layer, and (3) an output fully-connected layer. The output of the FFN can be denoted as

FFN (x) = \max (0, x W_{1} + b_{1}) W_{2} + b_{2},

(4)

where

x

is the input of FFN and its dimensionality is

H

. Assume the fully connected layer in FFN has dimensionality

d_{f}

.

W_{1} \in R^{H \times d_{f}}

and

W_{2} \in R^{d_{f} \times H}

are the weight matrices of the two fully connected layers.

b_{1} \in R^{d_{f}}

and

b_{2} \in R^{H}

are the biases. We measure

m

th neuron in the fully connected layer based on the weights. We retain the neurons with large weights. Detailed descriptions can be found in Appendix A. The compressed transformer encoder layer is denoted as Sub-T-Layer, as shown in Figure 1.

Moreover, neuron compression introduces bias in gradient estimation, with the update direction diverging from the uncompressed model’s trajectory. To mitigate accumulated gradient errors during federated learning of the compressed detection model, we employ knowledge distillation to minimize the divergence between the compressed and uncompressed models. The distillation approach operates at two granularities: (1) layer-level distillation that matches intermediate representations, and (2) neuron-level distillation that preserves critical activation patterns. Layer-level distillation ensures consistency between the compressed and uncompressed models by aligning both forward pass outputs and backward pass gradients at each transformer layer. Each layer’s distillation loss can be denoted as

{l o s s}_{l a y e r} = (\sum_{i = 1}^{D} ‖O_{i} - O_{i}^{'}‖) + μ ‖W_{1} W_{2} - W_{1}^{'} W_{2}^{'}‖,

(5)

where

D

is the size of the distillation dataset,

O_{i}

and

O_{i}^{'}

represent the outputs from the uncompressed and the compressed models, respectively,

W_{1}

and

W_{2}

are the weight matrices in the uncompressed model,

W_{1}^{'}

and

W_{2}^{'}

are the weight matrices in the compressed models, and

μ

is the regularization coefficient.

The frozen FFN parameters in the compressed model introduce catastrophic forgetting risks during federated learning, as they cannot adapt to client-specific malicious traffic patterns. To address this, we employ neuron-level distillation to preserve activation knowledge across server-client updates. Neuron-level distillation identifies low-impact neurons for the malicious traffic detection task by computing their Average Percentage of Zero activations (APoZ) [51]. Set the labelled malicious traffic data on the client

x_{i} \in D^{'}

,

x_{i}

is a token sequence input into the transformer layer. the APoZ for each neuron can be defined as

A P o Z = \frac{1}{|D^{'}| |x_{i}|} \sum_{i = 1}^{|D^{'}|} \sum_{j = 1}^{|x_{i}|} I (O_{x_{i}}^{g r l} = 0),

(6)

where

O_{x_{i}}^{g r l}

is the output of the

g

th token of the input

x_{i}

at the

r

th neuron in the

l

th layer, and

I (\cdot)

is the indicator function. We calculate the APoZ for each neuron on the client using the local network traffic dataset. Subsequently, only selected neurons in the FFN are updated based on their APoZ values.

4. Experiments

4.1. Experiment Setup

4.1.1. Datasets

To evaluate the performance of the proposed MT-FBERT, we conducted comprehensive experiments utilizing four publicly available datasets encompassing diverse attack types that are representative of actual cyber threats encountered in the hydropower sector. The selected datasets include USTC-TFC2016 [52], IoT-22 [53], IoT-23 [54], and CIC-MalAnal2017 [55].

The USTC-TFC2016 dataset comprises encrypted network traffic samples categorized into 10 distinct malware families and 10 benign application types. The malware samples include prominent threats such as Cridex, Geodo, Htbot, Miuref, Neris, Nsis-ay, Shifu, Tinba, Virut, and Zeus. The benign traffic encompasses common network applications, including BitTorrent, Facetime, FTP, Gmail, MySQL, Outlook, Skype, SMB, Weibo, and WorldOfWarcraft.

The IoT-22 dataset was systematically collected across six operational states of IoT devices: Power-on initialization, Idle standby mode, Normal user interactions, Predefined usage scenarios, Active operation, and Attack conditions. During the Attack state, malicious traffic is collected by executing two distinct attacks: UDP/HTTP Flood attacks and RTSP Brute Force attacks.

The IoT-23 dataset encompasses 33 labeled attack variants, which have been classified into seven threat categories: Distributed Denial-of-Service attacks, Denial-of-Service attacks, Reconnaissance activities, Web-based exploits, Credential brute-force attempts, Network spoofing attacks, and Mirai botnet-related malicious activities.

The CICMalAnal2017 dataset includes 4354 malicious specimens and 6500 benign applications. The malware corpus is classified into four principal categories: Adware, Ransomware, Scareware, and SMS-based malware. These categories encompass 42 well-documented malware families that exhibit diverse malicious capabilities.

For the raw .pcap files in the above-mentioned datasets, data preprocessing is performed:

(1): Flow Segmentation: The open-source tool SplitCap [56] is used to divide the raw network traffic based on TCP and UDP sessions. A session is defined as a group of packets sharing the same 5-tuple (source IP, destination IP, source port, destination port, and transport protocol).
(2): Session Filtering and Cleaning: Small sessions are removed. Non-payload metadata fields, such as the Global Header and Packet Header, are excluded. Duplicate packets are also merged or removed to reduce redundancy.

After data preprocessing, the dataset exhibited a flow imbalance. For instance, in the USTC-TFC2016 dataset, the number of flows for the Shifu threat is 2, while the Cridex threat has 8189 flows. To mitigate this issue, we merged flows from selected malicious categories, and the restructured dataset is presented in Table 1.

4.1.2. Comparison Methods

To comprehensively evaluate MT-FBERT, we conducted comparisons with various baselines, including machine learning methods, deep learning methods, and pre-training methods.

(1): Classical machine learning-based method: FlowPrint [57] relies on statistical features such as packet size distribution, inter-arrival times, and directional flow patterns to generate discriminative fingerprints for different network traffic types.
(2): Deep learning-based method: TFE-GNN [58] employs graph neural networks to perform supervised traffic analysis by processing both packet-level features and raw byte sequences.
(3): Pretraining-based method: YaTC [36] captures traffic representations during pre-training and subsequently fine-tunes for specific tasks with limited labeled data.

4.1.3. Evaluation Metrics and Implementation Details

When evaluating the malicious traffic detection model, the results can be categorized as either correct or incorrect, with all possible outcomes falling into the following four scenarios: (1) True positive (TP): the attack samples are classified as attack, (2) True negative (TN): the normal samples are classified as normal, (3) False positive (FP): the normal samples are misclassified as attack, (4) False negative (FN): the attack samples are misclassified as normal. Based on the above definition, Accuracy (AC), Precision (PR), Recall (RC), and F1 can be obtained:

A C = \frac{T P + T N}{T P + T N + F P + F N},

(7)

P R = \frac{T P}{T P + F P},

(8)

R C = \frac{T P}{T P + F N},

(9)

F 1 = 2 \cdot \frac{P R \cdot R C}{P R + R C},

(10)

In pre-training, the batch size is 32 and the total steps is 100,000. We set the learning rate is

2 \times 10^{- 5}

, and the ratio of warmup is 0.048, and the dropout rate is 0.1. We fine-tune with the SGD optimizer for 100 epochs, where the learning rate is set to

8 \times 10^{- 4}

. In the simulated FL setup, we set 100 clients with 500 communication rounds and employ the Dirichlet data partition method to construct different label-skew data heterogeneity scenarios. In each communication round, we randomly select 10 clients for local fine-tuning using a linear decay of the global learning rate over rounds. We use FedAvg [38] for global model aggregation. We construct the Sub-T-Layer by compressing the intermediate 10 layers. All the experiments are implemented with PyTorch 1.9.0 and trained on a server with two NVIDIA Tesla T4 GPUs.

4.2. Overall Comparisons

We evaluated the performance of MT-FBERT in malicious traffic detection by comparing it with three baselines on four malicious traffic datasets. As shown in Table 2, MT-FBERT achieves statistically significant performance improvements over all comparative methods on both the CICMalAnal2017 and IoT-22 datasets, exhibiting superior results across all evaluation metrics. In addition, MT-FBERT reaches 99% in all metrics on both the USTC-TFC2016 and IoT-23 datasets.

From Table 2, we can find that the TFE-GNN and YaTC methods achieve perfect scores across all metrics on USTC-TFC2016 and IoT-22 datasets. The reason is that these two datasets only contain two types of network traffic data: normal and malicious flows. Furthermore, we perform the cross-dataset evaluation on USTC-TFC2016 and IoT-22 datasets. Training the models on the USTC-TFC2016 dataset and then testing them on IoT-22, and vice-versa. The results are illustrated in Table 3. From the table, we can see that MT-FBERT and YaTC demonstrate more stable detection performance compared with other baselines. Because MT-FBERT and YaTC are the pre-training models that have good robustness and generalization. To comprehensively evaluate the performance of MT-FBERT and baselines across different datasets, we calculated the average values of AC, PR, RC, and F1 over the four datasets, denoted as Avg. AC, Avg. PR, Avg. RC, and Avg. F1 respectively. Figure 5 illustrates the comparative results of MT-FBERT against other baselines in terms of Avg. AC, Avg. PR, Avg. RC and Avg. F1. From the figure, we can see that MT-FBERT achieves 0.9176 for Avg. AC, 0.9146 for Avg. PR, 0.9178 for Avg. RC, and 0.9221 for Avg. F1. Compared with baselines, MT-FBERT demonstrates superior performance in terms of the highest average values of all metrics over the four datasets. These results indicate that the MT-FBERT performs robustly across various datasets, maintaining consistent malicious traffic detection performance regardless of the differences in dataset characteristics. This highlights its strong generalizability and reliability for malicious traffic detection.

In addition, the t-test is conducted to validate if there are significant improvements between MT-FBERT and the baselines. Taking FlowPrint as an example, the Null Hypothesis (H0) assumes no significant improvements between MT-FBERT and FlowPrint. Alternate Hypothesis (H1): A significant difference between MT-FBERT and FlowPrint makes the improvement significant. First, we calculate the means and standard deviations of the MT-FBERT and FlowPrint based on AC, PR, RC, and F1 in Table 1, donated as

({\bar{x}}_{1}, {\bar{x}}_{2})

and

(s_{1}, s_{2})

. Second, we calculate the observed value as:

t_{o b s e r v e d_v a l u e} = \frac{{\bar{x}}_{1} - {\bar{x}}_{2}}{\sqrt{\frac{s_{1}^{2}}{n_{1}} + \frac{s_{2}^{2}}{n_{2}}}}

(11)

where

n_{1}

and

n_{2}

are the number of all metrics of MT-FBERT and FlowPrint. Third, the degree of freedom (DOF) is calculated based on

D O F = n_{1} + n_{2} - 2

, and the critical value

t_{c r i t i c a l_v a l u e}

is retrieved from the T-distribution table with respect to DOF at a 95% confidence interval. The results can be found in Table 4. We can see that there are significant improvements when we compare MT-FBERT with FlowPrint and TFE-GNN. Although the improvements are not significant between MT-FBERT and YaTC, MT-FBERT can protect the data privacy based on the FL training, but YaTC is trained based on the centralized mode, which can pose privacy concerns.

4.3. Few-Shot Analysis

To validate the robustness of MT-FBERT in few-shot scenarios, we set the labeled data size to 10%, 40%, and 80% and compare MT-FBERT with baselines on four datasets. Experimental results are shown in Table 5, Table 6, Table 7 and Table 8. In addition, to clearly demonstrate the performance, we computed the average AC, PR, RC, and F1 across four datasets under limited labeled data conditions, denoted as Avg. AC, Avg. PR, Avg. RC and Avg. F1. The results are shown in Figure 6. From the figure, it can be observed that the Avg. AC, Avg. PR, Avg. RC and Avg. F1 of MT-FBERT at the different labeled sample sizes is significantly higher than other methods. The Avg. F1 of MT-FBERT decreases only slightly from 0.9175 (at 80% labeled data) to 0.9042 (at 10% labeled data), a drop of less than 1.5 percentage points. This demonstrates the MT-FBERT’s remarkable stability and consistency, indicating its low dependence on large volumes of labeled data. Notably, the MT-FBERT is able to achieve performance close to that of higher labeling ratios, even when using only 10% of labeled samples. This suggests that the MT-FBERT can effectively leverage limited supervision for learning and optimization. This characteristic is particularly valuable in real-world scenarios where labeled malicious traffic data is difficult to obtain, as it reduces annotation costs and improves the practical usability of the model.

4.4. Impact of Data Heterogeneity

We then evaluate the performance of MT-FBERT fine-tuned in Non-Independent Identically Distributed (Non-I.I.D) data distribution scenarios. For the datasets used in the data heterogeneity experiments, we unequally partition the dataset into 100 clients following the label distribution

Y_{i} ~ D i r (α p)

, where

i

is the client ID,

p

is the global label distribution,

α

is the degree of Non-I.I.D and a smaller

α

generates a high label distribution shift. We construct three different label-skewing scenarios by adjusting the value of α: 1.0, 5.0, and 10.0. Table 9 shows the results for fine-tuning MT-FBERT in different data-heterogeneous scenarios on four datasets.

We can observe that the performance of MT-FBERT is stable and consistent as the degree of Non-I.I.D increases. Especially, all the metrics are higher than 99% on the USTC-TFC2016 dataset and IoT-22 dataset when we set the degree of Non-I.I.D as 1.0, 5.0, and 10.0, respectively. This demonstrates the MT-FBERT’s strong generalization capability and robustness to data distribution variations. Such properties are crucial in real-world deployments, where data distributions are highly imbalanced. The generalization ability of the model directly affects its reliability and effectiveness in practical applications.

4.5. Efficiency Evaluation

To evaluate the efficiency of MT-FBERT, we conducted experiments comparing speed and GPU memory consumption with baselines. Given that deep learning-based and pre-training-based approaches require GPU resources, our comparative analysis specifically focuses on MT-FBERT versus TFE-GNN and YaTC.

As shown in Table 10, MT-FBERT achieves the fastest speed and the lowest memory consumption among all compared methods across various input batch sizes. At a batch size of 128, the MT-FBERT requires only approximately 206 MB of GPU memory and completes training in 2 to 5 s, depending on the dataset size. These results indicate that MT-FBERT maintains strong performance while demanding minimal computational resources. This makes it particularly advantageous in resource-constrained environments such as edge computing devices, where efficient hardware utilization is critical for deployment and operation.

4.6. Parameter and Component Study

We investigate the impact of two key hyperparameters during federated fine-tuning for malicious traffic detection: the interval

t

between neuron compression and knowledge distillation, and the neuron update proportion

p

in compression operations. The interval

t

was set to 5, 10, and 20, while

p

was configured as 0.3, 0.5, and 0.8. Table 11 presents the performance metrics on CICMalAnal2017 and IoT-23 datasets, demonstrating that optimal results occur at

t = 10

and

p = 0.5

. For interval

t

, longer periods exacerbate gradient error accumulation, whereas shorter intervals impede federated fine-tuning adaptation. Concerning proportion

p

, it is important to make a balance between updating an adequate proportion of neurons and avoiding excessive disruption to local fine-tuning.

In addition, we verify the contribution of the NBP task in the pre-training on the CICMalAnal2017 dataset, and the results in Table 12 show that the detection performance of MT-FBERT degrades when the NBP is removed. And we conduct the comparisons between FedAvg and FedOpt [59] in the fine-tuning. The results on the CICMalAnal2017 dataset in Table 13 demonstrate FedOpt achieves the better detection performance compared with FedAvg. Because the FedOpt designs an optimized framework to deal with the data heterogeneity.

5. Discussion

In this section, we discuss the limitations of this work. (i) Single-Flow Detection Capability. MT-FBERT operates as a flow-by-flow detection scheme, where the input is exclusively composed of information from a single network flow. In multiflow scenarios (e.g., accessing a web page often generates multiple concurrent flows), the detection performance of MT-FBERT may exhibit degradation. To mitigate this limitation, the input architecture could be extended to incorporate multidimensional data, enabling the model to leverage multidimensional attention mechanisms for multi-flow analysis. (ii) Pre-training Data Security. Although MT-FBERT demonstrates strong robustness and generalization across various malicious traffic detection datasets, its effectiveness hinges on the availability of clean pre-training data. When an attacker deliberately inserts low-frequency subwords as toxic embeddings, a poisoned pre-trained model with hidden backdoors can be induced, compelling the model to mispredict target classes and subsequently compromise the normally fine-tuned model in specific threat detection tasks [14,15]. Notably, the problem of constructing toxic tokens within network traffic remains unexplored in existing literature.

6. Conclusions

In this paper, we introduce a novel malicious traffic detection method based on an efficient FL framework of BERT, called MT-FBERT. MT-FBERT pre-trains the network traffic model to take advantage of the unlabeled traffic data to learn generic traffic patterns without relying on expert experience. Furthermore, we employ an FL framework optimized for efficient malicious traffic detection, which operates across distributed clients possessing labeled malicious traffic datasets. This framework employs an adaptive neuron selection mechanism that dynamically identifies and updates only the most significant neurons in the global model during client-side fine-tuning. MT-FBERT successfully addresses the dual challenges of privacy preservation and detection efficacy in multi-organization cybersecurity environments.

Through extensive experimentation across multiple benchmark malicious traffic datasets, MT-FBERT demonstrates its powerful ability to generalize from limited data and various distribution shifts. The efficient FL framework of MT-FBERT maintains strong performance while demanding minimal computational resources. This makes it particularly advantageous in resource-constrained environments. In the future, we will deploy MT-FBERT in the real-world network environment to validate its practical efficacy.

Author Contributions

Conceptualization, J.T.; methodology, J.T.; software, J.T.; validation, J.T., Z.H. and C.L.; writing—original draft preparation, J.T. and Z.H.; writing—review and editing, C.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Basic Research Program of Yunnan Provincial Science and Technology Department, China, grant number 202501AS070131.

Data Availability Statement

The datasets and codes used during the current study are available from the corresponding author on reasonable request.

Acknowledgments

We would like to express our sincere gratitude to the anonymous reviewers for their invaluable feedback and insightful comments. We also appreciate their time and effort in reviewing our work, which significantly contributed to the improvement of our paper.

Conflicts of Interest

Author Jian Tang and Zhao Huang were employed by the company China International Water & Electric Corp. The remaining author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Appendix A

We measure

m

th neuron in the fully connected layer based on the magnitude-based pruning method [53], which is defined as

R (m) = \sqrt{\sum_{j = 1}^{H} (u_{j m}^{2} + v_{m j}^{2})} 1 \leq m \leq d_{f},

(A1)

where

u_{j m}

is the

(m, j)

element in the weight matrices

W_{2}

, and

v_{m j}

is the

(m, j)

element in the weight matrices

W_{1}

. The larger

R (m)

is, the larger the weight the

m

th neuron is. Hence, we retain the neurons with large

R (m)

and remove the neurons with small

R (m)

.

References

Varghese, S.A.; Ghadim, A.D.; Balador, A.; Alimadadi, Z.; Papadimitratos, P. Digital twin-based intrusion detection for industrial control systems. In Proceedings of the IEEE International Conference on Pervasive Computing and Communications Workshops and Other Affiliated Events (PerCom Workshops), Pisa, Italy, 21–25 March 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 611–617. [Google Scholar]
Falliere, N.; Murchu, L.O.; Chien, E. W32. Stuxnet Dossier; White Paper; Symantec Corp., Security Response: Cupertino, CA, USA, 2011. [Google Scholar]
Alladi, T.; Chamola, V.; Zeadally, S. Industrial control systems: Cyberattack trends and countermeasures. Comput. Commun. 2020, 155, 1–8. [Google Scholar] [CrossRef]
Pinto, A.; Dragoni, Y.; Carcano, A. Triton: The first ICS cyber attack on safety instrument systems. Proc. Black Hat USA 2018, 2018, 1–26. [Google Scholar]
Akbarian, F.; Fitzgerald, E.; Kihl, M. Intrusion detection in digital twins for industrial control systems. In Proceedings of the International Conference on Software, Telecommunications and Computer Networks (SoftCOM), Split, Croatia, 17–19 September 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 1–6. [Google Scholar]
Abrams, M.; Weiss, J. Malicious Control System Cyber Security Attack Case Study—MaroochyWater Services, Australia; The MITRE Corporation: McLean, VA, USA, 2008. [Google Scholar]
Aoudi, W.; Iturbe, M.; Almgren, M. Truth will out: Departure-based process-level detection of stealthy attacks on control systems. In Proceedings of the ACM SIGSAC Conference on Computer and Communications Security, Toronto, ON, Canada, 15–19 October 2018; pp. 817–831. [Google Scholar]
Chen, T.; Abu-Nimeh, S. Lessons from Stuxnet. Computer 2011, 44, 91–93. [Google Scholar] [CrossRef]
Nelson, N. The Impact of Dragonfly Malware on Industrial Control Systems; SANS Institute: North Bethesda, MD, USA, 2016. [Google Scholar]
Spenneberg, R.; Brüggemann, M.; Schwartke, H. PLC-Blaster: A Worm Living Solely in the PLC. In Proceedings of the Black Hat Asia, Singapore, 29 March–1 April 2016. [Google Scholar]
Govil, N.; Agrawal, A.; Tippenhauer, N. On Ladder Logic Bombs in Industrial Control Systems. In Computer Security, Proceedings of the ESORICS 2017 International Workshops, CyberICPS 2017 and SECPRE 2017, Oslo, Norway, 14–15 September 2017; Springer: Cham, Switzerland, 2017. [Google Scholar]
Abbasi, A.; Hashemi, M. Ghost in the PLC Designing an Undetectable Programmable Logic Controller Rootkit via Pin Control Attack. In Proceedings of the Black Hat Europe, London, UK, 3–4 November 2016. [Google Scholar]
Hadžiosmanović, D.; Sommer, R.; Zambon, E.; Hartel, P. Through the Eye of the PLC: Semantic Security Monitoring for Industrial Processes. In Proceedings of the Annual Computer Security Applications Conference, New Orleans, LA, USA, 8–12 December 2014; ACM: New York, NY, USA, 2014. [Google Scholar]
Venezuelanalysis. Venezuela Hit by Electrical Blackout, Authorities Denounce Attack. Available online: https://venezuelanalysis.com/news/breaking-venezuela-hit-by-electrical-blackout-authorities-denounce-attack/ (accessed on 21 June 2025).
Nadeau, M. Attackers Hijack Solar Panel Monitoring Devices in Japan. Available online: https://www.iotm2mcouncil.org/iot-library/news/smart-energy-news/attackers-hijack-solar-panel-monitoring-devices-in-japan (accessed on 21 June 2025).
Greig, J. German Wind Turbine Maker Shut Down After Cyberattack. Available online: https://therecord.media/german-wind-turbine-maker-shut-down-after-cyberattack (accessed on 21 June 2025).
Xu, Y.; Wang, X. Security Governance of Cross-Border Data Flow Under the Holistic View of National Security. Doc. Inf. Knowl. 2023, 40, 20–30. [Google Scholar]
Xu, W.; Wang, S.; Zuo, X. Global data governance at a turning point? Rethinking China-US cross-border data flow regulatory models. Comput. Law Secur. Rev. 2024, 55, 106061. [Google Scholar] [CrossRef]
Zhai, D. RCEP Rules on Cross-Border Data Flows: Asian characteristics and implications for developing countries. Asia Pac. Law Rev. 2025, 33, 24–45. [Google Scholar] [CrossRef]
McPherson, J.; Ma, K.L.; Krystosk, P.; Bartoletti, T.; Christensen, M. PortVis: A tool for port-based detection of security events. In Proceedings of the ACM Workshop on Visualization and Data Mining for Computer Security, Washington, DC, USA, 29 October 2004; pp. 73–81. [Google Scholar]
Korczyński, M.; Duda, A. Markov chain fingerprinting to classify encrypted traffic. In Proceedings of the IEEE INFOCOM 2014—IEEE Conference on Computer Communications, Toronto, ON, Canada, 27 April–2 May 2014; pp. 781–789. [Google Scholar]
Ning, J.; Poh, G.S.; Loh, J.C.N.; Chia, J.; Chang, E.C. PrivDPI: Privacy-preserving encrypted traffic inspection with reusable obfuscated rules. In Proceedings of the ACM SIGSAC Conference on Computer and Communications Security, London, UK, 11–15 November 2019; pp. 1657–1670. [Google Scholar]
Panchenko, A.; Lanze, F.; Pennekamp, J.; Engel, T.; Zinnen, A.; Henze, M.; Wehrle, K. Website fingerprinting at internet scale. In Proceedings of the Network and Distributed System Security Symposium (NDSS), San Diego, CA, USA, 21–24 February 2016; pp. 1–15. [Google Scholar]
Shekhawat, A.S.; Di Troia, F.; Stamp, M. Feature analysis of encrypted malicious traffic. Expert Syst. Appl. 2019, 125, 130–141. [Google Scholar] [CrossRef]
Cai, Z.; Jiang, B.; Lu, Z.; Liu, J.; Ma, P. isAnon: Flow-based anonymity network traffic identification using extreme gradient boosting. In Proceedings of the 2019 International Joint Conference on Neural Networks (IJCNN), Budapest, Hungary, 14–19 July 2019; pp. 1–8. [Google Scholar]
Wang, W.; Zhu, M.; Wang, J.; Zeng, X.; Yang, Z. End-to-end encrypted traffic classification with one-dimensional convolution neural networks. In Proceedings of the 2017 IEEE International Conference on Intelligence and Security Informatics (ISI), Beijing, China, 22–24 July 2017; pp. 43–48. [Google Scholar]
Zhang, J.; Li, F.; Ye, F.; Wu, H. Autonomous unknown-application filtering and labeling for DL-based traffic classifier update. In Proceedings of the IEEE INFOCOM 2020—IEEE Conference on Computer Communications, Toronto, ON, Canada, 6–9 July 2020; pp. 397–405. [Google Scholar]
Lotfollahi, M.; Siavoshani, M.J.; Zade, R.S.H.; Saberian, M. Deep packet: A novel approach for encrypted traffic classification using deep learning. Soft Comput. 2020, 24, 1999–2012. [Google Scholar] [CrossRef]
Lin, K.; Xu, X.; Gao, H. TSCRNN: A novel classification scheme of encrypted traffic based on flow spatiotemporal features for efficient management of IIoT. Comput. Netw. 2021, 190, 107974. [Google Scholar] [CrossRef]
Peters, M.E.; Ammar, W.; Bhagavatula, C.; Power, R. Semi-supervised sequence tagging with bidirectional language models. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL), Vancouver, BC, Canada, 30 July–4 August 2017; pp. 1756–1765. [Google Scholar]
Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar]
Radford, A.; Narasimhan, K.; Salimans, T.; Sutskever, I. Improving Language Understanding by Generative Pre-Training. OpenAI. 2018, unpublished. Available online: https://www.mikecaptain.com/resources/pdf/GPT-1.pdf (accessed on 26 June 2025).
Lan, Z.; Chen, M.; Goodman, S.; Gimpel, K.; Sharma, P.; Soricut, R. ALBERT: A lite BERT for self-supervised learning of language representations. In Proceedings of the International Conference of Learning Representations, Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. RoBERTa: A robustly optimized BERT pretraining approach. arXiv 2019, arXiv:1907.11692. [Google Scholar]
Wang, T.; Xie, X.; Wang, W.; Wang, C.; Zhao, Y.; Cui, Y. Netmamba: Efficient network traffic classification via pre-training unidirectional Mamba. In Proceedings of the 2024 IEEE 32nd International Conference on Network Protocols (ICNP), Charleroi, Belgium, 28–31 October 2024; pp. 1–11. [Google Scholar]
Zhao, R.; Zhan, M.; Deng, X.; Wang, Y.; Wang, Y.; Gui, G.; Xue, Z. Yet another traffic classifier: A masked autoencoder based traffic transformer with multi-level flow representation. Proc. AAAI Conf. Artif. Intell. 2023, 37, 5420–5427. [Google Scholar] [CrossRef]
Hang, Z.; Lu, Y.; Wang, Y.; Xie, Y. Flow-MAE: Leveraging masked autoencoder for accurate, efficient and robust malicious traffic classification. In Proceedings of the 26th International Symposium on Research in Attacks, Intrusions and Defenses, Hong Kong, China, 16–18 October 2023; pp. 297–314. [Google Scholar]
McMahan, B.; Moore, E.; Ramage, D.; Hampson, S.; Arcas, B.A. Communication-efficient learning of deep networks from decentralized data. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS), Fort Lauderdale, FL, USA, 20–22 April 2017; pp. 1273–1282. [Google Scholar]
Zhou, X.; Huang, W.; Liang, W.; Yan, Z.; Ma, J.; Pan, Y.; Wang, K. Federated distillation and blockchain empowered secure knowledge sharing for internet of medical things. Inf. Sci. 2024, 662, 120217. [Google Scholar] [CrossRef]
Zhang, J.; Liang, S.; Ye, F.; Hu, R.Q.; Qian, Y. Towards detection of zero-day botnet attack in IoT networks using federated learning. In Proceedings of the IEEE International Conference on Communications (ICC), Rome, Italy, 28 May–1 June 2023; pp. 7–12. [Google Scholar]
Mun, H.; Lee, Y. Internet traffic classification with federated learning. Electronics 2021, 10, 27. [Google Scholar] [CrossRef]
He, Z.; Yin, J.; Wang, Y.; Gui, G.; Adebisi, B.; Ohtsuki, T.; Gacanin, H.; Sari, H. Edge device identification based on federated learning and network traffic feature engineering. IEEE Trans. Cogn. Commun. Netw. 2021, 8, 1898–1909. [Google Scholar] [CrossRef]
Majeed, U.; Khan, L.U.; Hong, C.S. Cross-silo horizontal federated learning for flow-based time-related-features oriented traffic classification. In Proceedings of the 2020 21st Asia-Pacific Network Operations and Management Symposium (APNOMS), Daegu, Republic of Korea, 22–25 September 2020. [Google Scholar]
Guo, Y.; Wang, D. FEAT: A federated approach for privacy-preserving network traffic classification in heterogeneous environments. IEEE Internet Things J. 2022, 10, 1274–1285. [Google Scholar] [CrossRef]
Tian, Y.; Wan, Y.; Lyu, L. FedBERT: When federated learning meets pre-training. ACM Trans. Intell. Syst. Technol. (TIST) 2022, 13, 66. [Google Scholar] [CrossRef]
Lit, Z.; Sit, S.; Wang, J.; Xiao, J. Federated split BERT for heterogeneous text classification. In Proceedings of the 2022 International Joint Conference on Neural Networks (IJCNN), Padua, Italy, 18–23 July 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 1–8. [Google Scholar]
Zhang, Z.; Yang, Y.; Dai, Y.; Wang, Q.; Yu, Y.; Qu, L.; Xu, Z. FedPETuning: When Federated Learning Meets the Parameter-Efficient Tuning Methods of Pre-trained Language Models. In Proceedings of the Findings of the Association for Computational Linguistics, Toronto, ON, Canada, 9–14 July 2023; pp. 9963–9977. [Google Scholar]
Wang, X.; Chen, K.; Shou, L.; Luo, X.; Chen, G. Efficient Framework for BERT Model Training Based on Federated Learning. Ruan Jian Xue Bao/J. Softw. 2025, 36, 4111–4134. (In Chinese) [Google Scholar]
Wen, W.; Wu, C.; Wang, Y.; Chen, Y.; Li, H. Learning structured sparsity in deep neural networks. In Proceedings of the 30th International Conference on Neural Information Processing Systems, Barcelona, Spain, 5–10 December 2016; Volume 29. [Google Scholar]
Bogoychev, N. Not all parameters are born equal: Attention is mostly what you need. arXiv 2020, arXiv:2010.11859. [Google Scholar]
Hu, H.; Peng, R.; Tai, Y.-W.; Tang, C.-K. Network trimming: A data-driven neuron pruning approach towards efficient deep architectures. arXiv 2016, arXiv:1607.03250. [Google Scholar]
Wang, W.; Zhu, M.; Zeng, X.; Ye, X.; Sheng, Y. Malware traffic classification using convolutional neural network for representation learning. In Proceedings of the 2017 International Conference on Information Networking (ICOIN), Da Nang, Vietnam, 11–13 January 2017. [Google Scholar]
Dadkhah, S.; Mahdikhani, H.; Danso, P.K.; Zohourian, A.; Truong, K.A.; Ghorbani, A.A. Towards the development of a realistic multidimensional IoT profiling dataset. In Proceedings of the 2022 19th Annual International Conference on Privacy, Security & Trust (PST), Fredericton, NB, Canada, 22–24 August 2022. [Google Scholar]
Neto, E.C.P.; Dadkhah, S.; Ferreira, R.; Zohourian, A.; Lu, R.; Ghorbani, A.A. CICIoT2023: A real-time dataset and benchmark for large-scale attacks in IoT environment. Sensors 2023, 23, 5941. [Google Scholar] [CrossRef]
Lashkari, A.H.; Kadir, A.F.A.; Taheri, L.; Ghorbani, A.A. Toward developing a systematic approach to generate benchmark Android malware datasets and classification. In Proceedings of the 2018 International Carnahan Conference on Security Technology (ICCST), Montreal, QC, Canada, 22–25 October 2018. [Google Scholar]
SplitCap. Available online: https://www.netresec.com/?page=SplitCap (accessed on 28 May 2025).
Van Ede, T.; Bortolameotti, R.; Continella, A.; Ren, J.; Dubois, D.J.; Lindorfer, M.; Choffnes, D.; Van Steen, M.; Peter, A. Flowprint: Semi-supervised mobile-app fingerprinting on encrypted network traffic. In Proceedings of the Network and Distributed System Security Symposium (NDSS), San Diego, CA, USA, 23–26 February 2020; Volume 27. [Google Scholar]
Zhang, H.; Yu, L.; Xiao, X.; Li, Q.; Mercaldo, F.; Luo, X.; Liu, Q. TFE-GNN: A temporal fusion encoder using graph neural networks for fine-grained encrypted traffic classification. In Proceedings of the ACM Web Conference 2023, Austin, TX, USA, 30 April–4 May 2023; pp. 2066–2075. [Google Scholar]
Ahmed, S.T.; Vinoth Kumar, V.; Mahesh, T.R.; Prasad, L.V.N.; Velmurugan, A.K.; Muthukumaran, V. FedOPT: Federated learning-based heterogeneous resource recommendation and optimization for edge computing. Soft Comput. 2024, 1–12. [Google Scholar] [CrossRef]

Figure 1. The framework of MT-FBERT.

Figure 2. An example of the traffic data pre-processing.

Figure 3. An example of token representation based on three embeddings.

Figure 4. An illustration of the Masked Burst Modeling (MBM).

Figure 5. Comparative results in terms of Avg. AC, Avg. PR, Avg. RC and Avg. F1 across four datasets.

Figure 6. Comparative results in terms of Avg. AC, Avg. PR, Avg. RC, and Avg. F1 across four datasets using different labeled sample sizes.

Table 1. The datasets’ information after data preprocessing.

Dataset	Network Traffic Categories	The Number of Flows
USTC-TFC2016	Benign	15,000
USTC-TFC2016	Malicious	14,207
IoT-22	Benign	23,000
IoT-22	Malicious	22,900
IoT-23	Benign	23,264
	DoS and DDoS	20,000
	MITM and Spoofing	11,516
	Recon and Scanning	17,768
CICMalAnal2017	Benign	243,577
	Adware	70,173
	Ransomware	68,032
	Scareware	76,761
	SMSMalware	45,258

Table 2. Overall comparisons of four methods on four malicious traffic datasets.

Dataset	Method	AC	PR	RC	F1
USTC-TFC2016	MT-FBERT	0.9996	0.9901	0.9993	0.9993
	FlowPrint	0.8904	0.6471	0.8462	0.7333
	TFE-GNN	1.0000 *	1.0000	1.0000	1.0000
	YaTC	1.0000	1.0000	1.0000	1.0000
CICMalAnal2017	MT-FBERT	0.8688	0.8700	0.8655	0.8683
	FlowPrint	0.4936	0.4382	0.4389	0.4376
	TFE-GNN	0.2696	0.2959	0.2696	0.2061
	YaTC	0.7283	0.7314	0.7277	0.7055
IoT-22	MT-FBERT	0.9998	0.9998	0.9998	0.9940
	FlowPrint	0.9574	0.9574	1.0000	0.9783
	TFE-GNN	1.0000	1.0000	1.0000	1.0000
	YaTC	1.0000	1.0000	1.0000	1.0000
IoT-23	MT-FBERT	0.8023	0.7986	0.8065	0.8266
	FlowPrint	0.5647	0.5179	0.5415	0.5172
	TFE-GNN	0.6962	0.7300	0.6962	0.6907
	YaTC	0.7449	0.7438	0.7449	0.7433

* A number in bold means that the corresponding method performs better than its counterpart does, for the same metric and the same dataset.

Table 3. Cross-dataset evaluation.

Dataset	Method	AC	PR	RC	F1
Training on USTC-TFC2016, testing on IoT 22	MT-FBERT	0.9808	0.9753	0.9751	0.9806
	FlowPrint	0.9928 *	0.9951	0.9977	0.9964
	TFE-GNN	0.6701	0.7949	0.6701	0.6156
	YaTC	0.9775	0.9700	0.9787	0.9795
Training on IoT 22, testing on USTC-TFC2016	MT-FBERT	0.9791	0.9796	0.9725	0.9766
	FlowPrint	0.7317	0.7317	1.0000	0.8451
	TFE-GNN	0.7593	0.8350	0.7593	0.7410
	YaTC	0.9720	0.9667	0.9659	0.9728

* A number in bold means that the corresponding method performs better than its counterpart does, for the same metric and the same dataset.

Table 4. Significance between MT-FBERT and baselines.

Comparison Methods	$t_{o b s e r v e d_v a l u e}$	$t_{c r i t i c a l_v a l u e}$	Significant Improvements
MT-FBERT vs. FlowPrint	3.9738	0.0004	True
MT-FBERT vs. TFE-GNN	2.1821	0.0370	True
MT-FBERT vs. YaTC	1.2622	0.2166	False

Table 5. Comparison results on USTC-TFC2016 dataset using different labeled sample sizes.

Labeled Data Size	10%				40%				80%
Method	AC	PR	RC	F1	AC	PR	RC	F1	AC	PR	RC	F1
MT-FBERT	0.9990	0.9932	0.9927	0.9975	0.9995	0.9903	0.9903	0.9947	0.9995	0.9991	0.9991	0.9991
FlowPrint	0.5333	0.2708	1.0000 *	0.4262	0.8429	0.5556	0.7692	0.6452	0.8197	0.5500	0.8462	0.6667
TFE-GNN	1.0000	1.0000	1.0000	1.0000	1.0000	1.0000	1.0000	1.0000	1.0000	1.0000	1.0000	1.0000
YaTC	1.0000	1.0000	1.0000	1.0000	1.0000	1.0000	1.0000	1.0000	1.0000	1.0000	1.0000	1.0000

* A number in bold means that the corresponding method performs better than its counterpart does, for the same metric and the same labeled sample size.

Table 6. Comparison results on CICMalAnal2017 dataset using different labeled sample sizes.

Labeled Data Size	10%				40%				80%
Method	AC	PR	RC	F1	AC	PR	RC	F1	AC	PR	RC	F1
MT-FBERT	0.8629 *	0.8518	0.8389	0.8322	0.8650	0.8580	0.8516	0.8673	0.8685	0.8691	0.8693	0.8690
FlowPrint	0.4633	0.3886	0.3665	0.3754	0.4991	0.4344	0.4315	0.4327	0.4870	0.4210	0.4133	0.4164
TFE-GNN	0.4024	0.1619	0.4024	0.2309	0.3996	0.2869	0.3996	0.3088	0.2640	0.3550	0.2640	0.2055
YaTC	0.5703	0.3253	0.5703	0.4143	0.6426	0.6238	0.6426	0.5933	0.7300	0.7315	0.7300	0.7038

* A number in bold means that the corresponding method performs better than its counterpart does, for the same metric and the same labeled sample size.

Table 7. Comparison results on IoT-22 dataset using different labeled sample sizes.

Labeled Data Size	10%				40%				80%
Method	AC	PR	RC	F1	AC	PR	RC	F1	AC	PR	RC	F1
MT-FBERT	0.9987	0.9992	0.9989	0.9987	0.9992	0.9992	0.9992	0.9992	0.9998	0.9954	0.9961	0.9985
FlowPrint	0.1455	1.0000 *	0.1132	0.2034	0.9123	0.9630	0.9455	0.9541	0.9667	0.9667	1.0000	0.9831
TFE-GNN	1.0000	1.0000	1.0000	1.0000	1.0000	1.0000	1.0000	1.0000	1.0000	1.0000	1.0000	1.0000
YaTC	0.9996	0.9815	0.9808	0.9808	0.9996	0.9985	0.9985	0.9985	0.9996	0.9960	0.9960	0.9960

* A number in bold means that the corresponding method performs better than its counterpart does, for the same metric and the same labeled sample size.

Table 8. Comparison results on IoT-23 dataset using different labeled sample sizes.

Labeled Data Size	10%				40%				80%
Method	AC	PR	RC	F1	AC	PR	RC	F1	AC	PR	RC	F1
MT-FBERT	0.7912 *	0.7910	0.7861	0.7883	0.7915	0.7906	0.7633	0.7795	0.8020	0.8137	0.8164	0.8022
FlowPrint	0.5062	0.4425	0.4847	0.4275	0.5465	0.5225	0.5063	0.5087	0.6348	0.4210	0.4133	0.5727
TFE-GNN	0.5443	0.5715	0.5443	0.5389	0.7215	0.7533	0.7215	0.7151	0.7025	0.7657	0.7025	0.7014
YaTC	0.5871	0.5855	0.5715	0.5701	0.6588	0.6816	0.6572	0.6536	0.7224	0.7251	0.7226	0.7191

* A number in bold means that the corresponding method performs better than its counterpart does, for the same metric and the same labeled sample size.

Table 9. Non-Independent Identically Distributed (Non-I.I.D) experimental results of MT-FBERT.

Dataset	Degree of Non-I.I.D	AC	PR	RC	F1
USTC-TFC2016	1.0	0.9996	0.9991	0.9991	0.9992
	5.0	0.9998	0.9996	0.9995	0.9995
	10.0	0.9998	0.9996	0.9996	0.9996
CICMalAnal2017	1.0	0.9501	0.9213	0.9325	0.9147
	5.0	0.9588	0.9570	0.9741	0.9348
	10.0	0.9646	0.9342	0.9284	0.9527
IoT-22	1.0	0.9992	0.9993	0.9990	0.9990
	5.0	0.9998	0.9995	0.9996	0.9996
	10.0	0.9999	0.9998	0.9998	0.9999
IoT-23	1.0	0.9652	0.9621	0.9666	0.9610
	5.0	0.9711	0.9637	0.9622	0.9602
	10.0	0.9715	0.9728	0.9887	0.9821

Table 10. The speed and GPU memory consumption comparison.

Dataset	Method	MT-FBERT		TFE-GNN		YaTC
Dataset	Batchsize	GPU Memory Usage (MB)	Time Consumption (Seconds)	GPU Memory Usage (MB)	Time Consumption (Seconds)	GPU Memory Usage (MB)	Time Consumption (Seconds)
USTC-TFC2016	32	62.73 *	2.23	2441.42	419.62	79.76	2.30
	64	110.85	2.19	3748.65	431.89	126.23	2.29
	128	206.27	2.18	6122.31	353.29	219.09	2.25
CICMalAnal2017	32	62.72	24.75	2553.35	5254.05	79.77	26.29
	64	119.86	24.67	3859.78	6150.25	126.24	26.00
	128	206.27	24.65	6107.81	3966.97	219.10	25.37
IoT-22	32	62.73	3.07	1984.45	541.61	79.76	3.36
	64	110.85	3.01	2816.71	540.59	126.23	3.27
	128	206.27	2.98	4147.03	522.92	219.09	3.24
IoT-23	32	62.72	4.30	2308.53	1147.46	79.77	4.95
	64	110.83	4.28	3317.26	1507.28	126.24	4.90
	128	206.28	4.22	5383.61	1949.53	219.10	4.85

* A number in bold means that the corresponding method performs better than its counterpart does, for the same metric and the same dataset.

Table 11. The effects of two hyperparameters on CICMalAnal2017 and IoT-23 datasets.

Dataset	The Interval t Between Neuron Compression and Knowledge Distillation	The Proportion p of Neurons That Need to Be Updated	AC	PR	RC	F1
CICMalAnal2017	5	0.3	0.8640	0.8593	0.8482	0.8584
		0.5	0.8643	0.8662	0.8666	0.8728
		0.8	0.8641	0.8709	0.8623	0.8710
	10	0.3	0.8685	0.8700	0.8651	0.8681
		0.5	0.8720 *	0.8758	0.8942	0.8843
		0.8	0.8688	0.8690	0.8655	0.8629
	20	0.3	0.8644	0.8583	0.8369	0.8375
		0.5	0.8645	0.8685	0.8643	0.8547
		0.8	0.8638	0.8688	0.8721	0.8628
IoT-23	5	0.3	0.8018	0.7992	0.8021	0.8284
		0.5	0.8275	0.8267	0.8462	0.8737
		0.8	0.8253	0.7895	0.7983	0.7992
	10	0.3	0.8020	0.7985	0.8065	0.8266
		0.5	0.8399	0.8372	0.8565	0.8564
		0.8	0.8013	0.8263	0.8436	0.8261
	20	0.3	0.8021	0.7981	0.7997	0.8043
		0.5	0.8196	0.8236	0.8265	0.8265
		0.8	0.8284	0.8252	0.8163	0.8348

* A number in bold means that the corresponding hyperparameter performs better than its counterpart does, for the same metric and the same dataset.

Table 12. NBP evaluation on CICMalAnal2017 dataset.

Method	AC	PR	RC	F1
MT-FBERT_MBM	0.7149	0.6693	0.6892	0.7570
MT-FBERT_MBM+NBP	0.8688	0.8700	0.8655	0.8683

Table 13. Global model aggregation evaluation on the CICMalAnal2017 dataset.

Method	AC	PR	RC	F1
MT-FBERT_FedAvg	0.8688	0.8700	0.8655	0.8683
MT-FBERT_FedOpt	0.8750	0.8798	0.8796	0.8841

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Tang, J.; Huang, Z.; Li, C. MT-FBERT: Malicious Traffic Detection Based on Efficient Federated Learning of BERT. Future Internet 2025, 17, 323. https://doi.org/10.3390/fi17080323

AMA Style

Tang J, Huang Z, Li C. MT-FBERT: Malicious Traffic Detection Based on Efficient Federated Learning of BERT. Future Internet. 2025; 17(8):323. https://doi.org/10.3390/fi17080323

Chicago/Turabian Style

Tang, Jian, Zhao Huang, and Chunqiang Li. 2025. "MT-FBERT: Malicious Traffic Detection Based on Efficient Federated Learning of BERT" Future Internet 17, no. 8: 323. https://doi.org/10.3390/fi17080323

APA Style

Tang, J., Huang, Z., & Li, C. (2025). MT-FBERT: Malicious Traffic Detection Based on Efficient Federated Learning of BERT. Future Internet, 17(8), 323. https://doi.org/10.3390/fi17080323

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

MT-FBERT: Malicious Traffic Detection Based on Efficient Federated Learning of BERT

Abstract

1. Introduction

2. Related Works

2.1. Malicious Traffic Detection

2.2. Federated Learning

3. MT-FBERT

3.1. Framework

3.2. The Pre-Training Stage

3.2.1. Traffic Data Pre-Processing

3.2.2. The Pre-Training Tasks

3.3. The Malicious Traffic Detection Stage

4. Experiments

4.1. Experiment Setup

4.1.1. Datasets

4.1.2. Comparison Methods

4.1.3. Evaluation Metrics and Implementation Details

4.2. Overall Comparisons

4.3. Few-Shot Analysis

4.4. Impact of Data Heterogeneity

4.5. Efficiency Evaluation

4.6. Parameter and Component Study

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI