Federated Learning Spam Detection Based on FedProx and Multi-Level Multi-Feature Fusion

Xiong, Yunpeng; Cao, Junkuo; Chen, Guolian

doi:10.3390/informatics12030093

Open AccessArticle

Federated Learning Spam Detection Based on FedProx and Multi-Level Multi-Feature Fusion

by

Yunpeng Xiong

¹

,

Junkuo Cao

^2,*

and

Guolian Chen

^3,*

¹

School of Information Science and Technology, Hainan Normal University, Haikou 571158, China

²

Information Network and Data Center, Hainan Normal University, Haikou 571158, China

³

State-Owned Assets Management Office, Hainan Normal University, Haikou 571158, China

^*

Authors to whom correspondence should be addressed.

Informatics 2025, 12(3), 93; https://doi.org/10.3390/informatics12030093

Submission received: 13 July 2025 / Revised: 1 September 2025 / Accepted: 3 September 2025 / Published: 12 September 2025

(This article belongs to the Topic Recent Advances in Artificial Intelligence for Security and Security for Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

Traditional spam detection methodologies often neglect user privacy preservation, potentially incurring data leakage risks. Furthermore, current federated learning models for spam detection face several critical challenges: (1) data heterogeneity and instability during server-side parameter aggregation, (2) training instability in single neural network architectures leading to mode collapse, and (3) constrained expressive capability in multi-module frameworks due to excessive complexity. These issues represent fundamental research pain points in federated learning-based spam detection systems. To address this technical challenge, this study innovatively integrates federated learning frameworks with multi-feature fusion techniques to propose a novel spam detection model, FPW-BC. The FPW-BC model addresses data distribution imbalance through the FedProx aggregation algorithm and enhances stability during server-side parameter aggregation via a horse-racing selection strategy. The model effectively mitigates limitations inherent in both single and multi-module architectures through hierarchical multi-feature fusion. To validate FPW-BC’s performance, comprehensive experiments were conducted on six benchmark datasets with distinct distribution characteristics: CEAS, Enron, Ling, Phishing_email, Spam_email, and Fake_phishing, with comparative analysis against multiple baseline methods. Experimental results demonstrate that FPW-BC achieves exceptional generalization capability for various spam patterns while maintaining user privacy preservation. The model attained 99.40% accuracy on CEAS and 99.78% on Fake_phishing, representing significant dual improvements in both privacy protection and detection efficiency.

Keywords:

spam email detection; federated learning; multi-feature fusion; FedProx; privacy protection

1. Introduction

Spam emails serve not only to disseminate misinformation, including scams or rumors, but also present a risk to the information security of organizations and individuals. For instance, cybercriminals employ phishing emails to entice recipients into clicking on harmful links or downloading attachments, thereby obtaining access to personal data, bank accounts, and other confidential information [1]. Spam emails serve as a primary conduit for cyberattacks, exhibiting a notable rise in security threats, particularly through malicious attachments and phishing links, which constitute approximately 43% of their dissemination [2]. In recent decades, researchers have explored numerous approaches to combat spam emails, including mitigation strategies to avert spam email spoofing attacks [3], systems for evaluating user behavior in reaction to phishing emails [4], cybersecurity training to enhance awareness of phishing attack resilience [5,6], the development of email security awareness programs [7], and anti-phishing solutions tailored to specific corporate contexts [8]. These methods suggest strategies to mitigate spam emails from multiple viewpoints; however, they are frequently challenging to execute and typically necessitate manual intervention for optimal efficacy.

Benefiting from advancements in natural language processing (NLP) technologies, particularly machine learning and deep learning techniques [9,10], novel capabilities have been established for spam detection systems [11]. Initial machine learning detection technologies for spam emails in electronic and IoT platforms predominantly relied on supervised learning [12,13,14]. Supervised learning requires manual dataset annotation, which demands substantial time and effort while being heavily dependent on the quality and quantity of training data [15]. This dependency poses significant challenges for experimental implementation. Deep learning models with dynamically updated feature spaces [16]—such as Long Short-Term Memory (LSTM) [17], Convolutional Neural Network (CNN) [18], Gated Recurrent Unit (GRU) [19], and their bidirectional variants—demonstrate superior feature extraction capabilities and outperform traditional machine learning methods in performance metrics. However, single neural network architectures exhibit inherent limitations: constrained temporal dependency capture, insufficient depth in automated hierarchical representation learning, and inadequate acquisition of semantically complex textual features. Additionally, they remain susceptible to overfitting. These limitations collectively result in suboptimal performance of single neural network models, failing to achieve desired effectiveness.

Hybrid deep learning methods [20] are adept at leveraging contextual information from various modules and layers. However, these methods [21,22,23,24] necessitate the construction of a dictionary to process text data, and the resultant word vectors are static, complicating the handling of polysemy. Pre-trained models [25] and large language models [26,27] have also been extensively utilized in spam detection; however, their training costs present a substantial obstacle. By incorporating multi-feature fusion (MF), certain methods [28,29] improve the model’s feature extraction capabilities. Despite the fact that feature fusion has been demonstrated to be effective in a variety of fields [30,31,32,33], it has not been able to provide privacy protection when detecting spam.

The original intention of email was to allow users to share information more conveniently and securely [34], and any spam detection should not come at the cost of leaking private data. Federated learning (FL) [35,36] possesses distinct advantages due to its decentralized architecture and distributed machine learning, facilitating collaborative training across numerous devices and servers, thereby effectively safeguarding data privacy and security [37,38]. In federated learning, participants autonomously train models on local datasets and solely transmit the model parameters produced during training to the central server. The server consolidates parameters from various participants using an aggregation algorithm, producing globally optimized model parameters that are subsequently relayed to each participant. This procedure persists with successive model enhancements. This process ensures that the original data remains local, mitigating the risks associated with cross-domain data transmission and leakage, thereby fundamentally addressing the data privacy protection challenges inherent in traditional centralized learning. This mechanism can dismantle data silos, facilitating collaborative modeling with multi-source data, thereby enhancing model generalization and robustness while preserving data sovereignty.

Currently, federated learning has been extensively adopted across various artificial intelligence scenarios [39] including blockchain [40], smart healthcare [41,42], medical imaging [43], and Internet of Things (IoT) [44,45]. These applications leverage federated learning’s unique attributes to prevent private data leakage. While several methods have demonstrated notable performance in spam detection [46,47], these approaches tend to exhibit bias toward majority classes when handling imbalanced data distributions, resulting in unstable classification performance and limited detection capability. Single neural network models suffer from training instability that may lead to mode collapse, while multi-module architectures face expressiveness limitations due to excessive complexity. These issues represent critical pain points in contemporary spam detection model research.

To address these challenges, this paper innovatively integrates federated learning frameworks with multi-feature fusion techniques, proposing a novel spam detection model within this architecture. The proposed model employs three key technical enhancements: First, the FedProx aggregation algorithm [48,49] optimizes data distribution imbalance. Second, a horse-racing selection strategy improves stability during server-side parameter aggregation. Third, hierarchical multi-feature fusion effectively mitigates limitations inherent in both single and multi-module architectures. Consequently, the model significantly reduces computational overhead while enhancing training stability, achieving dual objectives of preventing sensitive privacy leakage and improving detection efficiency.

The remainder of this paper is organized as follows: Section 2 introduces recent methods in federated learning and non-federated learning. Section 3 provides a detailed description of the technologies and methods, including data acquisition and preparation, federated learning optimization strategies, word vector transformation, and pathway prediction. Section 4 presents the architecture of the federated learning system and discusses the specific structure of the model. Section 5 presents a detailed analysis of FedAvg and FedProx on six datasets. Section 6 concludes the paper and proposes research directions for future federated learning methods.

2. Related Research

A substantial body of effective and high-quality research has been conducted on the successful detection and classification of emails. Brindha et al. [50] proposed a phishing email detection and classification model, ICSOA-DLPEC, utilizing the intelligent cuckoo search (CS) optimization algorithm. Initially, ICSOA-DLPEC conducts a three-step preprocessing procedure comprising email cleansing, tokenization, and elimination of stop words. The CS algorithm is employed to extract pertinent feature vectors, while the N-gram method is integrated with the GRU model to identify and categorize phishing emails. Chinta et al. [51] utilized a BERT-LSTM hybrid model that, through extensive preprocessing and feature extraction, effectively identifies intricate patterns in phishing emails. While these methods attain remarkable accuracy in spam detection, they fail to address the concern of user privacy protection.

The primary challenge on social networks is safeguarding privacy. The data security of emails is a crucial and persistent concern within social networks. Ul Haq et al. [47] proposed a federated phishing email filtering (FPF) technology utilizing federated learning, natural language processing (NLP), and deep learning. This technology provides four training modalities: Training from Server Model (TSM), Training from New Data (TND), Re-Training with Incremental Learning (TIL), and Model Averaging (MA). It facilitates spam email detection without exchanging email content, maintaining accuracy consistently between 93% and 96%.

Thapa et al. [52] integrated the BERT model with federated learning and methodically assessed the performance of BERT [53], THEMIS, and THEMISb across various data distribution scenarios. The experiments encompassed three distinct scenarios: asymmetric data distribution, highly heterogeneous client datasets, and balanced data distribution. The study revealed that federated learning demonstrates remarkable robustness in complex scenarios involving imbalanced or highly dispersed data distributions. The efficacy of the BERT model in phishing email detection tasks within the centralized learning (CL) framework remains unverified. The research indicated that their method’s efficacy in local training and global aggregation stages is heavily reliant on BERT’s feature extraction abilities. Furthermore, various experiments indicated significant variability in the global model’s test outcomes, underscoring the necessity for enhanced stability.

Kaushal et al. [54] introduced a federated learning-based equitable clustering method that tackles critical issues such as privacy preservation and the intrinsic data distribution disparity in decentralized systems, exhibiting substantial performance enhancements compared to conventional federated learning models. This approach offers an efficient solution for decentralized spam detection, utilizing the privacy protection attributes of federated learning while improving data fairness, in contrast to centralized models and conventional federated learning. It establishes the groundwork for the extensive implementation of federated learning in privacy-sensitive areas. Nonetheless, despite the technological advancement in federated learning-based fair clustering, its detection accuracy remains significantly improvable.

Venčkauskas et al. [55] proposed a method to enhance the resilience of the federated learning (FL) global model against Byzantine attacks by exclusively accepting updates from reliable participants. This approach integrates FL with a domain-specific ontology for email, a semantic parser, and a benchmark dataset collected from a heterogeneous email corpus, thereby ensuring a high level of privacy protection. The method demonstrates a strong capability to continuously predict malicious behaviors in client models and exhibits significant effectiveness in countering malicious attacks. By leveraging a heterogeneous email corpus as a benchmark dataset, it effectively addresses challenges arising from data heterogeneity and shows considerable advantages. However, their approach, which is based on a machine learning model, does not achieve fully satisfactory performance across various metrics, with an accuracy of only approximately 80.0%. Although it effectively mitigates malicious attacks, improving the model’s performance remains a direction for future work.

Anh et al. [56] employed a lightweight Transformer architecture, PhoBERT, for SMS spam detection while preserving privacy through federated learning. They applied different aggregation algorithms for Vietnamese and English emails and conducted experiments on both IID and non-IID email data distributions, with each aggregation algorithm demonstrating highly competitive performance. Although the federated learning approach achieved classification capability comparable to that of centralized training methods, the model used was only a lightweight Transformer architecture without deeper integration of other neural networks. The feature representations extracted by a single Transformer architecture are inherently limited when handling complex semantics. Moreover, the dataset was relatively small and predominantly focused on Vietnamese emails, which may restrict the generalizability of this lightweight framework.

Table 1 compares the advantages and disadvantages of the methods mentioned in the literature, categorizing them into those without federated learning [23,50,51] and federated learning techniques [47,52,54,55,56]. It can be observed that non-federated detection methods inherently lack the capacity to safeguard user privacy, whereas federated learning—while offering enhanced protection for email data—introduces its own set of constraints. A comparative evaluation of methodological performance underscores that spam detection must inherently respect the private nature of email communications, thereby warranting the adoption of federated learning for strengthened privacy preservation. Nevertheless, a direct transplantation of conventional methods into a federated learning framework results in substantially compromised performance relative to their centralized counterparts. In response, a key contribution of this work is the strategic integration of previously effective neural architectures—such as BiGRU and BiLSTM—aimed at harnessing their complementary representational strengths to achieve superior overall efficacy. Empirical comparisons demonstrate that the proposed hybrid model yields competitive advantages over existing approaches.

3. Preliminaries

3.1. Data Description

To study and analyze spam emails, six datasets were specifically collected, all sourced from Kaggle and Hugging Face. The following six datasets are CEAS, Enron, Ling, Phishing_email, Spam_email, and Fake_phishing. The Fake_phishing dataset contains AI-generated fake phishing emails and real phishing emails, with 2000 fake emails and 2470 real ones. After data cleaning and processing, Table 2 illustrates the dataset distribution. The data is divided into a training set and a test set in a ratio of 0.8:0.2 during the training process.

3.2. Horse-Racing Selection Strategy

The process of uploading local parameters from the client involves the introduction of a horse-racing selection strategy (HRSS) that is based on client-side local training, as illustrated in Figure 1. HRSS selects a majority of clients with poorer parameters and a small number of clients with better parameters for aggregation in the initial round of training. It selects clients with an equal proportion of both types of parameters for aggregation in the second round. In the third round, it selects a majority of clients with superior parameters and a small number of clients with weaker parameters for aggregation. HRSS operates in cycles of three rounds, with each round selecting distinctive clients. The aggregation of local parameters from the clients can be optimized by the global model after multiple rounds of training. Algorithm 1 is the pseudocode of the HRSS strategy, and Table 3 introduces the relevant parameters in the pseudocode. This strategic selection mechanism enhances convergence speed by dynamically balancing exploration and exploitation throughout the training process, effectively mitigating the negative impact of heterogeneous client capabilities while promoting the integration of high-quality model updates.

Algorithm 1: HRSS: horse-race selection strategy for federated learning.

Require: Global model $θ^{0}$ , client set $C$ , total rounds T
Ensure: Optimized global model $θ^{T}$

1:: Initialize $t \leftarrow 1$
2:: while $t \leq T$ do
3:: $phase \leftarrow t mod 3$
4:: if $phase = 1$ then
5:: Select 70% suboptimal + 30% optimal clients
6:: else if $phase = 2$
7:: Select 50% suboptimal + 50% optimal clients
8:: else
9:: Select 30% suboptimal + 70% optimal clients
10:: end if
11:: for each client $i \in S_{selected}$ do
12:: $θ_{i}^{t} \leftarrow LocalTrain (θ^{t - 1}, i)$
13:: $s_{i} \leftarrow EvaluateQuality (θ_{i}^{t})$
14:: end for
15:: Sort clients by $s_{i}$ (descending)
16:: $θ^{t} \leftarrow Aggregate ({θ_{i}^{t}})$
17:: $t \leftarrow t + 1$
18:: end while $θ^{T}$

3.3. Word Vector Conversion

Word2Vec [57], as a model for converting text into vector representations, aims to map each word to a dense, low-dimensional vector such that semantically similar words are positioned closer within the vector space. Within the context window, the Continuous Bag of Words (CBOW) framework of Word2Vec computes the average of the word vectors corresponding to the surrounding context words, and uses this averaged vector to predict the target center word.

Assume the objective is to predict the central word x, with context words

w_{- c}, w_{- c + 1}, \dots,

w_{- 1}, w_{1}, w_{c - 1}, w_{c}

and c denoting the extent of the context window. In the sentence “The quick brown fox jumps over the lazy dog,” if “fox” is the focal word, the adjacent words “quick,” “brown,” “jumps,” and “over” constitute the context. The aim of the CBOW model is to optimize the conditional probability of predicting the center word

w_{0}

based on the context word vectors for model training. The equation is articulated as

\begin{matrix} P (w_{0} ∣ w_{- c}, \dots, w_{- 1}, w_{1}, \dots, w_{c}) = \frac{e^{v_{w_{0}}^{T} \cdot h}}{\sum_{w \in V} e^{v_{w}^{T} \cdot h}} \end{matrix}

(1)

\begin{matrix} h = \frac{1}{2 c} \sum_{i = - c, i \neq 0}^{c} v_{w_{i}} \end{matrix}

(2)

In Equation (1),

P (w_{0} ∣ w_{- c}, \dots, w_{- 1}, w_{1}, \dots, w_{c})

denotes the probability of predicting the target word

w_{0}

given the context word sequence, V represents the vocabulary (the set of all words),

v_{w_{0}}

is the word vector of the target word

w_{0}

, and

w_{0}

,

w_{- c}, \dots, w_{- 1}, w_{1}, \dots, w_{c}

denotes the set of context words within a contextual window of size c. In Equation (2), h represents the average vector of all context word embeddings, obtained by summing and then averaging the word vectors of the surrounding context words. Here,

v_{w_{i}}

represents the word vectors of the context word

w_{i}

, and c indicates the context window radius. The total number of context words is

2 c

, consisting of c words to the left and c words to the right of the target word. Thus,

v_{c}

corresponds to the averaged contextual vector used to predict the center word

w_{0}

.

3.4. Neural Network Model

3.4.1. BiLSTM

BiLSTM [17] (Bidirectional LSTM) is an extension of the traditional LSTM model that improves the model’s expressive power by processing sequence information simultaneously in both forward and backward directions. BiLSTM overcomes the vanishing gradient problem in traditional RNNs when training on long sequences by introducing a memory cell (Cell State). After processing with Word2vec, each word is converted into a fixed-dimensional vector. BiLSTM takes a sequence of word vectors

[x_{1}, x_{1}, \dots, x_{n}]

as input, where

x_{i}

is the word vector generated by Word2vec for the i word. As shown in Figure 2, the forward LSTM processes the input sequence from left to right, while the backward LSTM processes the sequence from right to left. The output of the BiLSTM is the combination of the hidden states of the forward and backward LSTMs.

\begin{matrix} {\vec{h}}_{t} = LSTM (x_{t}, {\vec{h}}_{t - 1}) \end{matrix}

(3)

\begin{matrix} {\overset{\leftarrow}{h}}_{t} = LSTM (x_{t}, {\overset{\leftarrow}{h}}_{t - 1}) \end{matrix}

(4)

\begin{matrix} h = [{\vec{h}}_{t} \otimes {\overset{\leftarrow}{h}}_{t}] \end{matrix}

(5)

In Equations (3)–(5),

x_{t}

represents the input vector at the i time step,

{\vec{h}}_{t - 1}

represents the forward hidden layer vector at time

t - 1

,

{\overset{\leftarrow}{h}}_{t - 1}

represents the backward hidden layer vector at time

t - 1

, and ⊗ represents the concatenation of the forward and backward outputs.

3.4.2. CNN

The CNN [18] consists of convolutional layers, pooling layers, and fully connected layers. The convolutional layer executes convolution operations to extract local features and produce feature maps, employing nonlinear activation functions (such as ReLU) to introduce nonlinearity into the model. The pooling layer diminishes the dimensionality of the feature maps through max pooling operations, which are subsequently flattened and fed into the fully connected layer. The Softmax activation function is employed to transform the input into a probability distribution across classes, yielding the final output.

\begin{matrix} z_{t} = \sum_{i = 0}^{k - 1} x_{t + i} \cdot w_{i} + b \end{matrix}

(6)

In Equation (6),

x_{t + i}

is the

t + i

element (or time step) of the input signal or feature map;

w_{i}

is the i element of the convolution kernel; k is the size of the convolution kernel (e.g., the length of the window); b is the bias term of the convolution layer;

z_{t}

is the output of the convolution operation, representing the response of the convolution kernel to the current input window.

3.5. FedProx Aggregation Algorithm

The conventional FedAvg algorithm operates under the assumption that data across clients is Independent and Identically Distributed (IID), meaning each client’s data is drawn from the same underlying distribution and remains statistically independent. However, in real-world scenarios, non-IID data and significant heterogeneity are prevalent, leading to substantial divergence in data distributions among clients and often severe class or quantity imbalance. Under such conditions, conventional aggregation mechanisms exhibit limitations, often resulting in slow convergence and unstable model performance.

The primary enhancement of FedProx is the local objective function, which is composed of two components: parameter aggregation and the local training objective. The objective function computed and optimized by each client on its local data is referred to as the local training objective function. Its formula is as follows:

min_{w_{k}} [F_{k} (w_{k}) + \frac{μ}{2} {∥w_{k} - w^{g}∥}^{2}]

(7)

In Equation (7),

w_{k}

is the local model parameters of the k-th client,

w^{g}

is the current global model parameters, the proximal parameter

μ

is a regularization coefficient that controls the degree of deviation between the local model and the global model, and

{∥w_{k} - w^{g}∥}^{2}

is the squared L2 distance between the local and global parameters.

\frac{μ}{2} {∥w_{k} - w^{g}∥}^{2}

is the proximal term, which forces each client’s update direction to be as close as possible to the global model, preventing the impact of non-IID data caused by excessively large local updates, thus addressing issues such as data heterogeneity.

F_{k} (w_{k})

is the local loss function, which is the model error metric computed by each client based on its own private data. The model parameters are adjusted using optimization methods such as gradient descent to ultimately help improve the overall performance of the global model. For the k-th client, the local dataset is

D_{k}

(containing

n_{k}

samples), and the model parameters are w, defined as follows:

F_{k} (w) = \frac{1}{n_{k}} \sum_{x_{i}, y_{i} \in D_{k}} l (w; x_{i}, y_{i})

(8)

Parameter aggregation, similar to federated aggregation methods, is performed using weighted averaging. The benefit of this straightforward aggregation approach is that it not only merges the model updates but also adjusts the contributions of each client according to weights, such as the amount of data each client possesses. Assuming there are K participating clients, each with a local model

w_{k}

and a training dataset size

n_{k}

, the local model updates are aggregated into a global model

w_{global}

. The formula for updating the global model parameters is as follows:

min w_{global}^{(t + 1)} = \sum_{k = 1}^{K} \frac{n_{k}}{N} w_{k}^{(t)}

(9)

In Equations (8) and (9),

l (w; x_{i}, y_{i})

is the loss function for a single sample,

N = \sum_{k = 1}^{K} n_{k}

is the total amount of data across all clients,

w_{global}^{(t + 1)}

is the global model parameters at the

t + 1

-th round, and

w_{k}^{(t)}

is the local model parameters of the k-th client after the t-th round of training.

4. Method

4.1. Federated Learning Architectural

In this federated learning system, a central server and N clients are present, as illustrated in Figure 3. The data is retained on the respective clients, and the data distributions vary between clients. Each client has its own local model and data. Clients train using their local data and subsequently transmit the parameters of their locally trained model to the central server. These parameters are aggregated by the central server, which then updates the global model and returns it to the clients for the subsequent training session. Algorithm 1 delineates the iteration process for each round. The components of the federated learning system are as follows:

Clients: The devices that are involved in model training and possess their own local data are referred to as clients. Local training tasks are executed by clients, who exclusively disclose the parameters of their local model (e.g., gradients or weights) following each training round. The aggregated updated parameters are downloaded from the global model in the subsequent training cycle.

Central Server: The central server coordinates the training process for each client, collects model update parameters from them, aggregates these parameters using an aggregation algorithm, updates the global model, and subsequently transmits it back to the local clients for further training.

Global Model: The global model is a model administered by the central server. It is updated following each training iteration to represent the shared attributes of all client data. The global model’s parameter modifications are based on the local updates provided by the clients.

4.2. Structure Design of Model

Under the federated learning framework of the FedProx algorithm, the optimized Word2Vec (FPW) is integrated with a multi-feature fusion module combining BiLSTM and CNN (BC), resulting in a hybrid neural model named FPW-BiLSTM-CNN (FPW-BC) capable of multi-feature fusion, as depicted in Figure 4. This hybrid model can more thoroughly assimilate the attributes of each individual module. The operational procedure is as follows (Algorithm 2):

(1): The embedding layer transforms input sequences of discrete symbolic tokens into continuous, dense vector representations. It accepts as input a sequence of vocabulary indices $S = [w_{1}, w_{2}, \dots, w_{L}]$ , where L denotes the sequence length and each $w_{i}$ represents an index within the vocabulary. Using an embedding matrix $E \in R^{V \times D}$ , each token index $w_{i}$ in the sequence is mapped to a dense vector representation via the operation $Embedding (w_{i}) = E [w_{i}] = e_{i} \in R^{D}$ , where V indicates the vocabulary size and D the embedding dimensionality. The entire sequence is thereby converted into an embedded representation $X_{emb} = {[e_{1}, e_{2}, \dots, e_{L}]}^{T} \in R^{L \times D}$ , which encapsulates semantic information and provides the input tensor for subsequent processing layers.
(2): Subsequently, takes the embedded matrix $X_{emb}$ as input and employs a Bidirectional LSTM (BiLSTM) to capture long-range contextual dependencies bidirectionally. The forward LSTM processes the sequence from $t = 1$ to $t = L$ , yielding forward hidden states $(\vec{h_{1}}, \dots, \vec{h_{L}})$ , while the backward LSTM processes it in reverse from $t = L$ to $t = 1$ , producing backward hidden states $(\overset{\leftarrow}{h_{1}}, \dots, \overset{\leftarrow}{h_{L}})$ . The full hidden state at each timestep t is the concatenation $h_{t} = [\vec{h_{t}} \oplus \overset{\leftarrow}{h_{t}}] \in R^{H}$ , where H is the total hidden dimension. The final output $V_{bilstm} \in R^{H}$ is obtained by extracting the last hidden state from this sequence.
(3): Then, the embedded and BiLSTM input $X_{emb}$ are processed using a Convolutional Neural Network employing multiple filters of varying widths (e.g., 5, 6, 7) to capture local features. For a filter of width k with parameters $W_{conv} \in R^{k \times D \times F}$ (where F denotes the number of filters of that specific size), a feature map is generated through convolution followed by ReLU activation: $c = ReLU (W_{conv} \times X_{emb} + b)$ . Global max pooling is then applied to each feature map to extract the most salient feature: $p = max (c)$ . Finally, the pooled features from all filters are concatenated to form a final feature vector $V_{cnn} \in R^{C}$ , where C represents the total number of filters.
(4): This module integrates the feature vectors generated by the two upstream pathways. The output of the BiLSTM, denoted as $V_{bilstm} \in R^{H}$ , encodes global contextual dependencies within the sequence, whereas the output of the CNN, $V_{cnn} \in R^{C}$ , captures discriminative local n-gram patterns. Feature fusion is implemented through the weighted combination of original features: $Z = [V_{bilstm} \oplus V_{cnn}] \in R^{(H + C)}$ .
The resulting fused vector comprehensively preserves the representational information from both feature extractors and is subsequently propagated to the classifier for downstream prediction tasks.
(5): Finally, the fused feature vector Z is transformed into a probabilistic distribution over the target categories. This process initiates with a fully connected linear layer that computes unnormalized logits: $o = W_{output} \cdot Z + b_{output}$ , where $W_{output} \in R^{V \times (H + C)}$ denotes the weight matrix and $b_{output} \in R^{V}$ represents the bias term. These logits are subsequently normalized into a probability distribution via the Softmax activation function: $\hat{y} j = P (y = j ∣ x) = \frac{exp (o_{j})}{\sum {k = 1}^{V} exp (o_{k})}$ . The resultant output $\hat{y} \in R^{V}$ constitutes a probability vector adhering to the constraint $\sum_{j = 1}^{V} {\hat{y}}_{j} = 1$ , thereby representing the model’s predictive probabilities across all candidate classes.

Algorithm 2: PW-BC model forward propagation.

Require: Input sequence S, model parameters $θ$
Ensure: Prediction probabilities $\hat{y}$

1:: // Embedding
2:: $X_{emb} \leftarrow Embedding (S)$ ${X_{emb} \in R^{L \times D}}$
3:: // Feature Extraction
4:: $V_{bilstm} \leftarrow BiLSTM (X_{emb})$ {e.g., take last hidden states, $V_{bilstm} \in R^{H}}$
5:: $V_{cnn} \leftarrow CNN (X_{emb})$ {Extract & pool features, $V_{cnn} \in R^{C}}$
6:: // Feature Fusion
7:: $Z \leftarrow Concatenate (V_{bilstm}, V_{cnn})$ ${Z \in R^{H + C}}$
8:: // Output
9:: $o \leftarrow W_{output} \cdot Z + b_{output}$ ${o \in R^{V}}$
10:: $\hat{y} \leftarrow Softmax (o)$ ${\hat{y} \in R^{V}$ , $\sum_{j} {\hat{y}}_{j} = 1}$
11:: return $\hat{y}$

5. Experimental and Discussion

5.1. Evaluation Indicators

In some situations, metrics like accuracy alone do not adequately capture specific aspects of a model’s performance. Given that spam detection is essentially a binary classification problem, the experiments employed not only accuracy, precision, recall, and F1-score, but also introduced additional metrics such as False Discovery Rate (FDR), False Negative Rate (FNR), False Positive Rate (FPR), Negative Predictive Value (NPV), SPC (Specificity), and Area Under the Curve (AUC). This approach was implemented across six different datasets to assess the model’s generalization ability, error detection performance, and overall reliability, providing a more detailed evaluation of the model’s performance across various dimensions. The metrics used in the following experiments are all expressed in percentage (%). Below are the detailed explanations for the added evaluation metrics.

FDR (False Discovery Rate):

F D R = \frac{F P}{F P + T P}

, FDR is the proportion of actually negative samples among all the samples predicted as positive by the model. FDR provides a better reflection of how many of the positive predictions are incorrect. A lower FDR indicates better accuracy in the model.

FNR (False Negative Rate):

F N R = \frac{F N}{F N + T P}

, FNR is the proportion of actually positive samples that are incorrectly predicted as negative. It reflects the model’s ability to miss positive samples. A lower FNR means the model is better at identifying positive samples.

FPR (False Positive Rate):

F P R = \frac{F P}{F P + T N}

, FPR is the proportion of actually negative samples that are incorrectly predicted as positive. It directly affects the model’s reliability, especially in applications where False Positives need to be strictly controlled. A lower FPR is preferable, indicating better model performance.

NPV (Negative Predictive Value):

S P C = \frac{T N}{T N + F P}

, NPV refers to the proportion of actually negative samples among all the samples predicted as negative. NPV focuses on the accuracy of negative results. A higher NPV indicates better accuracy in predicting negative samples.

SPC (Specificity):

N P V = \frac{T N}{T N + F N}

, it represents the proportion of actual negative samples that are correctly classified as negative, reflecting the model’s ability to identify negative cases. A higher Specificity indicates that the model is better at accurately recognizing negative samples.

The calculation of AUC (Area Under the Curve) is achieved through TPR (True Positive Rate,

T P R = \frac{T P}{T P + F N}

) and FPR. The formula for the calculation is

A U C = \sum_{i = 1}^{n - 1} \frac{(F P R_{i} - F P R_{i - 1}) (T P R_{i} - T P R_{i - 1})}{2}

, typically representing the area under the ROC curve. The higher the AUC value, the better the model’s classification performance. AUC is a very useful evaluation metric when dealing with imbalanced datasets.

The various parameters used to calculate the above metrics are as follows:

True Positive (TP): The number of TB-related clinical case reports that are correctly identified as TB cases by the model.

False Positive (FP): The number of non-TB-related clinical case reports that are incorrectly identified as TB cases by the model.

True Negative (TN): The number of non-TB-related clinical case reports that are correctly identified as non-TB cases by the model.

False Negative (FN): The number of TB-related clinical case reports that are incorrectly identified as non-TB cases by the model.

5.2. Experimental Comparison

5.2.1. Comparison of Aggregation Algorithms

Designed especially to solve the problem of data heterogeneity, FedProx explicitly limits the direction of parameter updates during local training to sufficiently control the deviation between the local model and the global aggregation parameters. By incorporating a proximal term, FedProx ensures consistency between the local model and the global model, whereas FedAvg solely optimizes the local loss, potentially resulting in unstable convergence in situations involving imbalanced data. Table 3 and Table 4 illustrate the performance of the FedProx and FedAvg algorithms across six datasets: CEAS, Enron, Ling, Phishing_email, Spam_email, and Fake_phishing, in relation to various evaluation metrics. The results for the class-imbalanced Ling dataset are obtained after applying SMOTE oversampling.

Table 4 demonstrates that in the FPW-BC model, FedProx surpasses FedAvg in key metrics including accuracy, precision, recall, and F1-score, signifying that following multi-feature fusion, FedProx exceeds the fundamental FedAvg aggregation algorithm in overall performance. The error detection scores in Table 5 indicate that FedAvg demonstrates training imbalance across these datasets. For instance, in Enron, when the FNR of FedAvg is inferior to that of FedProx, the FDR is significantly elevated. Conversely, a high NPV corresponds with a low SPC, signifying a discrepancy. Ling is a dataset characterized by significant class imbalance, with the quantity of spam instances considerably lower than that of ham. The error detection metrics for this dataset most precisely indicate the efficacy of an algorithm. When the Net Present Value (NPV) of FedAvg is inferior to that of FedProx, the Statistical Power of Classification (SPC) is elevated. Although a higher SPC is advantageous, the discrepancy with NPV is significant, and a similar issue arises with the False Negative Rate (FNR) and False Discovery Rate (FDR), indicating a substantial misalignment. In Fake_phishing, although FedAvg demonstrates a lower False Discovery Rate (FDR), it exhibits a higher False Negative Rate (FNR), and the values of Negative Predictive Value (NPV) and Specificity (SPC) also reveal a substantial discrepancy.

5.2.2. The Impact of the Number of Clients on the FPW-BC Model

The number of clients is a critical factor in federated learning design. While many existing federated learning methods are typically evaluated under ideal conditions with limited clients, real-world scenarios often involve substantially larger client populations, which significantly impact system performance. Our proposed federated learning framework incorporates dynamic client scaling capabilities. However, due to constrained computational resources, this study specifically evaluates scalability with up to 50 clients. Given the limited data volume in Fake_phishing and Ling datasets—where excessive client partitioning would result in insufficient local data—we select four datasets (CEAS, Enron, Phishing_email, and Spam_email) for client scalability analysis. The following experiments accordingly evaluate the F1-score performance of the FPW-BC model under varying client quantities using these datasets.

The following experiments are conducted under the same conditions. The experimental data in Figure 5 indicates that as the number of clients rises to 50, the accuracy of the FPW-BC model across these four datasets exhibits a declining trend. The increase in client numbers results in a reduction of training samples available to each client, causing a decline in performance, but it remains at a relatively elevated level. The accuracy disparity on the CEAS dataset is minimal, at less than 0.2%. Although the disparity in the Spam_email dataset is more pronounced, it remains under 1%. The discrepancies in the other two datasets are also approximately 1%.

5.2.3. The Resource Cost of the Aggregation Algorithm

The temporal consumption during parameter aggregation varies considerably among different federated aggregation algorithms across training rounds, with time overhead playing a critical role in practical deployments. In terms of computational complexity, both FedAvg and FedProx require clients to perform E epochs of Stochastic Gradient Descent (SGD) computations while processing identical sample sizes N. Consequently, their time complexities are equivalent at

O (E \times N \times D)

, where N denotes the number of local data samples per client and D represents the dimensionality of model parameters. This section therefore conducts a comprehensive comparative analysis of the temporal overhead and convergence characteristics between FedAvg and FedProx algorithms.

FedProx introduces a proximal term

μ

to constrain the direction of local model updates, addressing client drift issues caused by data heterogeneity by aligning local models closer to the global optimum. Due to the constant factor associated with computing the gradient of the proximal term, FedProx typically exhibits slightly longer per-round training time compared to FedAvg. However, this difference is generally not substantial, and as demonstrated in Figure 6, the overall time consumption of FedProx across four distinct datasets remains comparable to that of FedAvg.

During each communication round, both upstream and downstream transmissions involve data volumes equivalent to the model size

O (D)

. The principal advantage of FedProx lies in its ability to achieve target accuracy with fewer communication rounds through more stable training, thereby demonstrating superior communication efficiency over FedAvg in terms of total round count. Although introducing minimal additional computational overhead (via the

μ

term), this approach yields significant improvements in model performance and overall communication efficiency under heterogeneous environments without substantially increasing the communication burden. Through extensive hyperparameter tuning, optimal performance was achieved with parameter configurations set to

μ

= 0.04, local _epoch = 1, lr = 0.0001, and batch _size = 128. As evidenced in Figure 7, FedProx demonstrates accelerated convergence compared to FedAvg across four distinct datasets, requiring fewer training rounds to reach convergence. This reduction in required communication rounds indicates superior computational resource utilization efficiency.

5.2.4. The Impact of the Horse-Racing Selection Strategy on the FPW-BC Model

Distinct data features are present in the data of various clients during local training, and the update parameters they upload also differ. After the global model aggregates the parameters with various data features, this results in discrepancies in the test evaluation results, which in turn causes performance instability. In order to resolve this issue, a client horse-race selection strategy is implemented during the uploading of local parameters. This approach significantly enhances the situation in which the aggregation of local parameters with inconsistencies results in inefficient and slow global model performance. The performance of the four datasets, CEAS, Enron, Phishingemail, and Spam_email, is illustrated in Figure 8, as the Ling and Fakephishing datasets are relatively small. Furthermore, leveraging the performance of different clients shown in Figure 5, this experiment compares the above four datasets with client numbers ranging from 10 to 50. On the other hand, black signifies the absence of the horse-race random selection strategy, while red signifies its implementation. The experimental results in Figure 6 demonstrate that the horse-race selection strategy neither diminishes the performance of the FPW-BC model nor fails to surpass the model that lacks this strategy. Consequently, this strategy not only preserves the model’s overall performance but may also yield enhanced optimization outcomes.

5.2.5. Federated Learning and Hybrid Neural Network Model Fusion Experiment

The fusion model incorporates Attention, BiGRU, and BiLSTM, each adept at capturing contextual information within the input sequence, yet exhibiting distinct characteristics. When integrated with other neural network models, each demonstrates unique performance characteristics. The Attention mechanism is a weighted averaging process reliant on attention weights. It does not directly analyze sequence data based on the cyclical structure of time steps; rather, it assesses the significance of each component of the input sequence to determine the appropriate weighting. BiGRU is a bidirectional variant of the Gated Recurrent Unit and a streamlined version of the Long Short-Term Memory, featuring update and reset gates. BiGRU can concurrently process both forward and backward information within the sequence.

Table 6 presents the performance metrics of various hybrid neural network models, namely FPW–Attention–CNN, FPW-BiGRU-CNN, and FPW-BiLSTM-CNN (FPW-BC), within the identical federated learning algorithm framework. Neural network models such as Attention, BiGRU, and BiLSTM can augment the model’s robustness and markedly improve its performance; however, it is evident that each neural network possesses distinct performance attributes. The FPW-BC model surpasses other hybrid models in accuracy, precision, recall, F1-score, and AUC across six datasets. This suggests that feature fusion in the FPW-BC model is more effective at managing various email features, resulting in enhanced overall performance of the model.

5.2.6. Experimental Analysis on Imbalanced Datasets

The fundamental goal of spam email detection and filtering is to identify spam emails within a large number of ham emails, with the model’s superiority reflected in its ability to detect as many spam emails as possible. The Ling dataset is significantly imbalanced, with a significantly greater number of spam emails than ham emails. The most effective method of assessing the quality of an algorithm is to utilize this dataset to assess the model’s capacity to identify spam emails. On the Ling dataset, Figure 9 illustrates the confusion matrix for three models: FPW–Attention–CNN, FPW-BiGRU-CNN, and FPW-BC. The three models have a total of eleven, nine, and five misclassifications in detecting spam emails, respectively. This includes five, three, and three misclassifications of spam emails as ham. FPW-BC has the highest AUC, while FPW–Attention–CNN has the lowest, as evidenced by the AUC values for the Ling dataset in Table 5. AUC is an extremely beneficial evaluation metric for datasets that are imbalanced; the model’s classification performance is enhanced by a larger AUC value. Consequently, FPW-BC has a lower number of classification errors.

5.2.7. Experimental Analysis on the Fake Email Dataset

The detrimental impact of AI-generated fraudulent emails must not be overlooked. The previously mentioned hybrid neural network models were additionally assessed on the Fake_phishing dataset through the utilization of confusion matrices for a more thorough comparison. The confusion matrix in Figure 10 indicates that the misclassified emails for the FPW-Attention-CNN, FPW-BiGRU-CNN, and FPW-BC models were twelve, seven, and three, respectively. The quantity of emails that ought to have been categorized as spam but were erroneously classified as ham was five, five, and two, respectively. The disparities noted in a limited dataset are not substantial; however, the threat presented by phishing emails cannot be solely assessed by their frequency. Enhancing the identification of phishing emails will mitigate their detrimental effects. According to the FNR and FPR values in Table 4, the FPW-BiGRU-CNN model exhibits the highest FPR, signifying the greatest number of erroneously identified spam emails. Simultaneously, FPW-BiGRU-CNN exhibits the lowest False Negative Rate, indicating it has the fewest inaccurately identified ham emails.

5.2.8. Comparison of FPW-BC with Other Methods

Meanwhile, the experimental results of this study are systematically compared with state-of-the-art spam detection methods, as quantitatively summarized in Table 7. Among non-federated approaches, the method by Zavrak et al. [23] demonstrates consistent efficacy across multiple datasets, reflecting robust generalization capability. The proposed FPW-BC model also exhibits strong cross-dataset performance, indicating similarly high generalizability. Although the FT+HAN model achieves exceptional accuracy in one specific dataset, the FPW-BC model outperforms it across a broader range of datasets. This performance advantage stems from the deeper hierarchical feature integration in FPW-BC, where the fused vector representations more comprehensively capture discriminative characteristics of spam emails, resulting in superior predictive accuracy.

On the Enron and Phishing email datasets, although the FPW-BC model yields slightly lower accuracy compared to the methods proposed by Brindha et al. [50] and Chinta et al. [51], their approaches lack privacy protection mechanisms. In contrast, the FPW-BC model operates within a federated learning framework, inherently providing privacy preservation while achieving competitive performance through optimized training strategies. Thus, although their methods attain higher accuracy in these specific settings, the comparison remains inequitable due to fundamental differences in privacy constraints. It should be noted that non-federated methods often achieve outstanding performance in spam detection tasks, yet they largely overlook the inherent requirement for privacy protection in email data. For this reason, this section does not provide an in-depth discussion of such methods but rather focuses on selected representative approaches for contextual comparison.

Furthermore, within the federated learning framework, the proposed FPW-BC model demonstrates superior performance compared to the approach by Ul Haq et al. [47]. Their method merely combines federated learning with four baseline models in a relatively simplistic manner, lacking in-depth architectural integration or optimization, and achieves an accuracy of only 96.0%. Although each of the four models exhibits certain merits, the FPW-BC model significantly enhances spam detection performance through a multi-level fusion of BiLSTM and CNN neural networks within the federated learning setting.

When compared with the method proposed by Thapa et al. [52], the FPW-BC model does not surpass their reported accuracy value. However, their approach simply integrates a BERT pre-trained model with federated learning without further refinement. While such a combination can achieve competitive performance to some extent, the resulting accuracy is unstable and fluctuates within a certain range. In contrast, the FPW-BC model exhibits greater stability across multiple evaluation metrics. By incorporating the FedProx algorithm to mitigate data heterogeneity and class imbalance, the model further enhances its robustness and reliability. Therefore, the FPW-BC model represents a more comparable and systematically designed solution under realistic federated learning conditions. The method proposed by Kaushal et al. [54], while employing fair clustering to enhance data fairness and privacy protection, remains fundamentally based on a BERT model without extensive task-specific optimization. In contrast, the FPW-BC model incorporates multiple tailored optimization strategies to address various challenges specific to spam detection, resulting in significantly higher performance metrics such as accuracy—substantially exceeding the 80.49% achieved by their approach.

Among advanced federated learning methods, the approach by Venkauskas et al. [55] demonstrates strong resilience against Byzantine attacks, achieving notable effectiveness in mitigating malicious behaviors. Their method leverages a heterogeneous email benchmark dataset, through which it exhibits considerable advantages. However, as their framework is based on conventional machine learning models, its overall performance remains suboptimal, with an accuracy of only 80.0%. In comparison, although the federated learning framework underlying the FPW-BC model provides less stringent privacy protection than their method, it significantly outperforms theirs in terms of accuracy and robustness, owing to multi-feature fusion and optimized client selection strategies.

The approach proposed by Anh et al. [56] employs a lightweight Transformer architecture, PhoBERT, and utilizes different aggregation algorithms for Vietnamese and English emails. It demonstrates strong performance on both IID and non-IID email data distributions. However, their model relies solely on a lightweight Transformer structure without deeper integration of other neural networks, and the dataset used is relatively limited in scale and diversity. In contrast, the FPW-BC model is evaluated across multiple distinct datasets and leverages multi-feature fusion, achieving superior results compared to their method, thereby establishing its competitiveness.

Based on the comparative analysis with both federated and non-federated learning methods across identical datasets, the proposed approach demonstrates consistent superiority and outperforms recent spam detection methods in comprehensive scenarios.

5.2.9. Ablation Experiment

Ablation experiments can confirm whether each component of the model contributes positively to feature fusion, thereby increasing the model’s interpretability and transparency. BiLSTM integrates the surrounding context information of each word into its representation and captures contextual relationships in the text, as the FPW-BC model is built on BiLSTM and CNN. In the interim, CNN extracts additional local features when processing word vectors that are rich in contextual information. Consequently, the ablation experiments were conducted by omitting either the BiLSTM or CNN module, combining it with the federated learning algorithm framework, and applying it to six datasets. The experimental design is as follows:

Eliminate the BiLSTM module: Explore whether this would impair the model’s ability to capture both long-term and short-term dependencies, as well as bidirectional contextual information, ultimately affecting its comprehension of sequential data.
Eliminate the CNN module: Explore whether this would hinder the model’s ability to extract local features and short-term syntactic patterns, thereby limiting its capacity to capture crucial information in the text and reducing the extraction of contextually rich word vector features.

Within the identical framework of federated learning, the results of the ablation experiment (Table 8) indicate that, in contrast to the FPW-CNN model, FPW-BC attains superior performance in essential metrics such as accuracy, precision, recall, F1-score, and AUC, thereby affirming the pivotal function of BiLSTM in managing diverse information, modeling both long-term and short-term dependencies, and capturing bidirectional semantics. In comparison to the FPW-BiLSTM model, FPW-BC exhibits notable advantages, thereby validating the enhancing influence of CNN on word vector feature extraction.

While FPW-CNN and FPW-BiLSTM possess distinct advantages across various datasets, their coupled integration significantly enhances the model’s ability to comprehensively utilize contextual information and extract features. This outcome indicates that spam classification is primarily reliant on the broader context of the text instead of isolated keywords, and the feature fusion mechanism of the FPW-BC model effectively addresses this requirement, offering a more precise and interpretable solution for spam detection.

6. Conclusions

The detrimental effects of spam have been thoroughly corroborated by reputable organizations. Despite the varied advancements in current filtering technologies, conventional detection methods continue to exhibit considerable deficiencies in privacy protection. This study presents an email detection model, FPW-BC, that incorporates a federated learning framework alongside feature fusion mechanisms. The fundamental innovation resides in leveraging the attribute of federated learning that permits “the transmission of solely model parameters without disclosing raw data” to guarantee privacy protection. The FPW-BC model establishes a multi-tiered fusion architecture, intensifies the model’s emphasis on critical semantic information through the BiLSTM module, augments the capacity to extract local feature space information via the CNN module, and ultimately integrates the output features of both modules to attain accurate classification. The experimental findings indicate that the model markedly surpasses contemporary federated learning-based detection techniques across six prominent public datasets: CEAS, Enron, Ling, Phishingemail, Spamemail, and Fakephishing. The model attains an accuracy of 99.78% on the Fakephishing dataset and 99.34% on the CEAS dataset.

In practical settings, non-IID data and significant heterogeneity are common, where data distributions across clients exhibit considerable divergence and may even demonstrate severe class or quantity imbalance. Data imbalance constitutes a major practical obstacle, as many methods struggle to achieve robust spam detection performance under such conditions. Moving forward, our work will prioritize tackling data heterogeneity in federated learning through the adoption of enhanced federated optimization algorithms, multi-feature fusion mechanisms, and refined local aggregation strategies, with the goal of alleviating the current limitations of models in these respects.

Although federated learning mitigates privacy risks during parameter uploading, potential data leakage remains during model update and aggregation phases. For instance, when multiple participants submit updates to a central server, adversaries may infer sensitive data characteristics by monitoring model updates or analyzing variations across participants’ updates to deduce statistical properties of local datasets. Such information can be exploited in reconstruction attacks to recover private training data.

To counter these threats, differential privacy can be applied to inject carefully calibrated noise into the output, preventing attackers from inferring individual information or data features through output analysis. Alternatively, homomorphic encryption enables computation on encrypted data without decryption, thereby obscuring model update patterns and protecting data privacy during processing. Given the critical importance of data privacy in today’s networked society, enhancing privacy preservation remains a key direction for our future work.

During the process of privacy preservation, the challenge of data heterogeneity inevitably persists. While addressing such heterogeneity, it becomes imperative to simultaneously maintain the usability and privacy of the data. Therefore, integrating privacy protection mechanisms into solutions for data heterogeneity is likely to emerge as a critical trend. In subsequent work, we will focus on collaborative multi-model design strategies that enhance detection accuracy while ensuring privacy security. This approach aims to effectively mitigate data heterogeneity and promote the dual advancement of both privacy and efficiency in spam detection technology.

Author Contributions

Conceptualization, Y.X.; Methodology, J.C.; Software, Y.X.; Investigation, J.C.; Resources, G.C.; Data curation, Y.X.; Writing—original draft, Y.X.; Writing—review and editing, J.C.; Supervision, G.C.; Funding acquisition, G.C. All authors contributed to the study conception and design. All authors have read and agreed to the published version of the manuscript.

Funding

This work was funded by the Hainan Provincial Natural Science Foundation of China (625MS081) and the Research and Application of Harmful Information Detection in Hainan Free Trade Port Based on Multimodal Learning (Project category: Haikou Science and Technology Special Fund [2025-008]).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

All data generated or analyzed during this study are included in this article. If you have any questions, please contact the author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Yasin, S.M.; Azmi, I.H. Email spam filtering technique: Challenges and solutions. J. Theor. Appl. Inf. Technol. 2023, 101, 5130–5138. [Google Scholar]
Altulaihan, E.; Alismail, A.; Hafizur Rahman, M.; Ibrahim, A.A. Email security issues, tools, and techniques used in investigation. Sustainability 2023, 15, 10612. [Google Scholar] [CrossRef]
Sethuraman, S.C.; VS, D.P.; Reddi, T.; Reddy, M.S.T.; Khan, M.K. A comprehensive examination of email spoofing: Issues and prospects for email security. Comput. Secur. 2024, 137, 103600. [Google Scholar] [CrossRef]
Gallo, L.; Gentile, D.; Ruggiero, S.; Botta, A.; Ventre, G. The human factor in phishing: Collecting and analyzing user behavior when reading emails. Comput. Secur. 2024, 139, 103671. [Google Scholar] [CrossRef]
Butavicius, M.; Taib, R.; Han, S.J. Why people keep falling for phishing scams: The effects of time pressure and deception cues on the detection of phishing emails. Comput. Secur. 2022, 123, 102937. [Google Scholar] [CrossRef]
Marshall, N.; Sturman, D.; Auton, J.C. Exploring the evidence for email phishing training: A scoping review. Comput. Secur. 2024, 139, 103695. [Google Scholar] [CrossRef]
Morolong, M.P.; Shava, F.B.; Shilongo, V.G. Designing an email security awareness program for state-owned enterprises in namibia. In IOT with Smart Systems: Proceedings of ICTIS 2021; Springer: Singapore, 2022; Volume 2, pp. 679–688. [Google Scholar]
Febriyani, W.; Fathia, D.; Widjajarto, A.; Lubis, M. Security awareness strategy for Phishing email scams: A Case Study one of a company in Singapore. JOIV Int. J. Inform. Vis. 2023, 7, 808–814. [Google Scholar] [CrossRef]
Butt, U.A.; Amin, R.; Aldabbas, H.; Mohan, S.; Alouffi, B.; Ahmadian, A. Cloud-based email phishing attack using machine and deep learning algorithm. Complex Intell. Syst. 2023, 9, 3043–3070. [Google Scholar] [CrossRef]
Salloum, S.; Gaber, T.; Vadera, S.; Shaalan, K. A systematic literature review on phishing email detection using natural language processing techniques. IEEE Access 2022, 10, 65703–65727. [Google Scholar] [CrossRef]
Tusher, E.H.; Ismail, M.A.; Rahman, M.A.; Alenezi, A.H.; Uddin, M. Email Spam: A Comprehensive Review of Optimize Detection Methods, Challenges, and Open Research Problems. IEEE Access 2024, 12, 143627–143657. [Google Scholar] [CrossRef]
Jáñez-Martino, F.; Alaiz-Rodríguez, R.; González-Castro, V.; Fidalgo, E.; Alegre, E. A review of spam email detection: Analysis of spammer strategies and the dataset shift problem. Artif. Intell. Rev. 2023, 56, 1145–1173. [Google Scholar] [CrossRef]
Dharrao, D.; Gaikwad, P.; Singh, A.; Savita, A.; Dharrao, M.; Bongale, A. Effective Email Filtering: A Machine Learning Approach to Distinguish Safe and Phishing Emails with SVM. In Proceedings of the International Conference on Recent Innovations in Computing, Jammu, India, 26–27 October 2023; Springer: Singapore, 2023; pp. 177–190. [Google Scholar]
Sawe, L.; Gikandi, J.; Kamau, J.; Njuguna, D. Sentence level analysis model for phishing detection using knn. J. Cybersecur. 2024, 6, 25–39. [Google Scholar]
Ahmed, N.; Amin, R.; Aldabbas, H.; Koundal, D.; Alouffi, B.; Shah, T. Machine learning techniques for spam detection in email and IoT platforms: Analysis and research challenges. Secur. Commun. Netw. 2022, 2022, 1862888. [Google Scholar] [CrossRef]
Thakur, K.; Ali, M.L.; Obaidat, M.A.; Kamruzzaman, A. A systematic review on deep-learning-based phishing email detection. Electronics 2023, 12, 4545. [Google Scholar] [CrossRef]
Li, Q.; Cheng, M.; Wang, J.; Sun, B. LSTM based phishing detection for big email data. IEEE Trans. Big Data 2020, 8, 278–288. [Google Scholar] [CrossRef]
Bachri, C.M.; Gunawan, W. Deteksi Email Spam menggunakan Algoritma Convolutional Neural Network (CNN). JEPIN (J. Edukasi Dan Penelit. Inform.) 2024, 10, 88–94. [Google Scholar] [CrossRef]
Wanda, P. GRUSpam: Robust e-mail spam detection using gated recurrent unit (GRU) algorithm. Int. J. Inf. Technol. 2023, 15, 4315–4322. [Google Scholar] [CrossRef]
Hernández, G.; Zamora, E.; Sossa, H.; Téllez, G.; Furlán, F. Hybrid neural networks for big data classification. Neurocomputing 2020, 390, 327–340. [Google Scholar] [CrossRef]
Doshi, J.; Parmar, K.; Sanghavi, R.; Shekokar, N. A comprehensive dual-layer architecture for phishing and spam email detection. Comput. Secur. 2023, 133, 103378. [Google Scholar] [CrossRef]
Bountakas, P.; Xenakis, C. Helphed: Hybrid ensemble learning phishing email detection. J. Netw. Comput. Appl. 2023, 210, 103545. [Google Scholar] [CrossRef]
Zavrak, S.; Yilmaz, S. Email spam detection using hierarchical attention hybrid deep learning method. Expert Syst. Appl. 2023, 233, 120977. [Google Scholar] [CrossRef]
Jáñez-Martino, F.; Alaiz-Rodríguez, R.; González-Castro, V.; Fidalgo, E.; Alegre, E. Classifying spam emails using agglomerative hierarchical clustering and a topic-based approach. Appl. Soft Comput. 2023, 139, 110226. [Google Scholar] [CrossRef]
Nasreen, G.; Khan, M.M.; Younus, M.; Zafar, B.; Hanif, M.K. Email spam detection by deep learning models using novel feature selection technique and BERT. Egypt. Inform. J. 2024, 26, 100473. [Google Scholar] [CrossRef]
Uddin, M.A.; Sarker, I.H. An Explainable Transformer-based Model for Phishing Email Detection: A Large Language Model Approach. arXiv 2024, arXiv:2402.13871. [Google Scholar] [CrossRef]
Heiding, F.; Schneier, B.; Vishwanath, A.; Bernstein, J.; Park, P.S. Devising and Detecting Phishing Emails Using Large Language Models. IEEE Access 2024, 12, 42131–42146. [Google Scholar] [CrossRef]
Rustam, F.; Saher, N.; Mehmood, A.; Lee, E.; Washington, S.; Ashraf, I. Detecting ham and spam emails using feature union and supervised machine learning models. Multimed. Tools Appl. 2023, 82, 26545–26561. [Google Scholar] [CrossRef]
Hnini, G.; Riffi, J.; Mahraz, M.A.; Yahyaouy, A.; Tairi, H. MMPC-RF: A deep multimodal feature-level fusion architecture for hybrid spam E-mail detection. Appl. Sci. 2021, 11, 11968. [Google Scholar]
Liu, C.; Xu, X. AMFF: A new attention-based multi-feature fusion method for intention recognition. Knowl.-Based Syst. 2021, 233, 107525. [Google Scholar] [CrossRef]
Zhang, Z.; Damiani, E.; Hamadi, H.; Yeun, C.; Taher, F. A late multi-modal fusion model for detecting hybrid spam e-mail. Int. J. Comput. Theory Eng. 2023, 15, 76–81. [Google Scholar] [CrossRef]
Gu, X.; Zhao, H.; Jian, L. Sequence neural network for recommendation with multi-feature fusion. Expert Syst. Appl. 2022, 210, 118459. [Google Scholar] [CrossRef]
Ding, Y.; Zhang, Z.; Zhao, X.; Hong, D.; Cai, W.; Yu, C.; Yang, N.; Cai, W. Multi-feature fusion: Graph neural network and CNN combining for hyperspectral image classification. Neurocomputing 2022, 501, 246–257. [Google Scholar] [CrossRef]
Cerruto, F.; Cirillo, S.; Desiato, D.; Gambardella, S.M.; Polese, G. Social network data analysis to highlight privacy threats in sharing data. J. Big Data 2022, 9, 19. [Google Scholar] [CrossRef]
Zhang, C.; Xie, Y.; Bai, H.; Yu, B.; Li, W.; Gao, Y. A survey on federated learning. Knowl.-Based Syst. 2021, 216, 106775. [Google Scholar] [CrossRef]
Li, L.; Fan, Y.; Tse, M.; Lin, K.Y. A review of applications in federated learning. Comput. Ind. Eng. 2020, 149, 106854. [Google Scholar] [CrossRef]
Yin, X.; Zhu, Y.; Hu, J. A comprehensive survey of privacy-preserving federated learning: A taxonomy, review, and future directions. ACM Comput. Surv. (CSUR) 2021, 54, 1–36. [Google Scholar] [CrossRef]
Mothukuri, V.; Parizi, R.M.; Pouriyeh, S.; Huang, Y.; Dehghantanha, A.; Srivastava, G. A survey on security and privacy of federated learning. Future Gener. Comput. Syst. 2021, 115, 619–640. [Google Scholar] [CrossRef]
Banabilah, S.; Aloqaily, M.; Alsayed, E.; Malik, N.; Jararweh, Y. Federated learning review: Fundamentals, enabling technologies, and future applications. Inf. Process. Manag. 2022, 59, 103061. [Google Scholar] [CrossRef]
Zhu, J.; Cao, J.; Saxena, D.; Jiang, S.; Ferradi, H. Blockchain-empowered federated learning: Challenges, solutions, and future directions. ACM Comput. Surv. 2023, 55, 1–31. [Google Scholar] [CrossRef]
Nguyen, D.C.; Pham, Q.V.; Pathirana, P.N.; Ding, M.; Seneviratne, A.; Lin, Z.; Dobre, O.; Hwang, W.J. Federated learning for smart healthcare: A survey. ACM Comput. Surv. (CSUR) 2022, 55, 1–37. [Google Scholar] [CrossRef]
Li, H.; Li, C.; Wang, J.; Yang, A.; Ma, Z.; Zhang, Z.; Hua, D. Review on security of federated learning and its application in healthcare. Future Gener. Comput. Syst. 2023, 144, 271–290. [Google Scholar] [CrossRef]
Guan, H.; Yap, P.T.; Bozoki, A.; Liu, M. Federated learning for medical image analysis: A survey. Pattern Recognit. 2024, 151, 110424. [Google Scholar] [CrossRef]
Ficco, M.; Guerriero, A.; Milite, E.; Palmieri, F.; Pietrantuono, R.; Russo, S. Federated learning for IoT devices: Enhancing TinyML with on-board training. Inf. Fusion 2024, 104, 102189. [Google Scholar] [CrossRef]
Kang, J.; Li, X.; Nie, J.; Liu, Y.; Xu, M.; Xiong, Z.; Niyato, D.; Yan, Q. Communication-efficient and cross-chain empowered federated learning for artificial intelligence of things. IEEE Trans. Netw. Sci. Eng. 2022, 9, 2966–2977. [Google Scholar] [CrossRef]
Sun, Y.; Chong, N.; Ochiai, H. Federated phish bowl: Lstm-based decentralized phishing email detection. In Proceedings of the 2022 IEEE International Conference on Systems, Man, and Cybernetics (SMC), Prague, Czech Republic, 9–12 October 2022; IEEE: New York, NY, USA, 2022; pp. 20–25. [Google Scholar]
Ul Haq, I.; Black, P.; Gondal, I.; Kamruzzaman, J.; Watters, P.; Kayes, A. Spam email categorization with nlp and using federated deep learning. In Proceedings of the International Conference on Advanced Data Mining and Applications, Brisbane, QLD, Australia, 18–20 November 2022; Springer: Cham, Switzerland, 2022; pp. 15–27. [Google Scholar]
Li, T.; Sahu, A.K.; Zaheer, M.; Sanjabi, M.; Talwalkar, A.; Smith, V. Federated optimization in heterogeneous networks. Proc. Mach. Learn. Syst. 2020, 2, 429–450. [Google Scholar]
Su, L.; Xu, J.; Yang, P. A non-parametric view of FedAvg and FedProx: Beyond stationary points. J. Mach. Learn. Res. 2023, 24, 1–48. [Google Scholar]
Brindha, R.; Nandagopal, S.; Azath, H.; Sathana, V.; Joshi, G.P.; Kim, S.W. Intelligent Deep Learning Based Cybersecurity Phishing Email Detection and Classification. Comput. Mater. Contin. 2023, 74, 5901–5914. [Google Scholar] [CrossRef]
Chinta, P.C.R.; Moore, C.S.; Karaka, L.M.; Sakuru, M.; Bodepudi, V.; Maka, S.R. Building an Intelligent Phishing Email Detection System Using Machine Learning and Feature Engineering. Eur. J. Appl. Sci. Eng. Technol. 2025, 3, 41–54. [Google Scholar] [CrossRef] [PubMed]
Thapa, C.; Tang, J.W.; Abuadbba, A.; Gao, Y.; Camtepe, S.; Nepal, S.; Almashor, M.; Zheng, Y. Evaluation of federated learning in phishing email detection. Sensors 2023, 23, 4346. [Google Scholar] [CrossRef]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar]
Kaushal, V.; Sharma, S. Fairness-driven federated learning-based spam email detection using clustering techniques. Neural Comput. Appl. 2025, 37, 6515–6526. [Google Scholar] [CrossRef]
Venčkauskas, A.; Toldinas, J.; Morkevičius, N.; Serkovas, E.; Krištaponis, M. Enhancing the Resilience of a Federated Learning Global Model Using Client Model Benchmark Validation. Electronics 2025, 14, 1215. [Google Scholar] [CrossRef]
Anh, H.Q.; Anh, P.T.; Nguyen, P.S.; Hung, P.D. Federated Learning for Vietnamese SMS Spam Detection Using Pre-trained PhoBERT. In Proceedings of the International Conference on Intelligent Data Engineering and Automated Learning, Valencia, Spain, 20–22 November 2024; Springer: Cham, Switzerland, 2024; pp. 254–264. [Google Scholar]
Church, K.W. Word2Vec. Nat. Lang. Eng. 2017, 23, 155–162. [Google Scholar] [CrossRef]

Figure 1. Horse racing client selection strategy.

Figure 2. Bidirectional Long Short-Term Memory architecture.

Figure 3. Federated learning system architecture diagram.

Figure 4. The processing flow structure of each module of the FPW-BC model.

Figure 5. The F1-score performance of the FPW-BC model in different clients.

Figure 6. Time cost of aggregation algorithm on different datasets.

Figure 7. The accuracy of aggregation algorithms on different datasets.

Figure 8. The F1-score performance of the FPW-BC model in different clients.

Figure 9. Confusion matrix of FPW-Att-CNN, FPW-BiGRU-CNN, and FPW-BC models on Ling email dataset.

Figure 10. Confusion matrix of FPW-Att-CNN, FPW-BiGRU-CNN, and FPW-BC models on Fake_phishing dataset.

Table 1. The research background overview of some spam email detection methods.

Literature	Method	Dataset	Findings	Limitations
Zavrak et al. [23]	CNN–Attention–GRU	TREC 2007 Gen Spam Spam Assasin Enron Ling	Can be applied to multiple different datasets.	Neglecting user privacy protection.
Ul Haq et al. [47]	FL and Four NLP models	CEAS	Propose four detection models, add more features, and apply natural language processing techniques to improve accuracy.	Conducted on a limited number of federated users, assuming all federated users’ datasets have the same number of features.
Brindha et al. [50]	GRU and Search Algorithm	Enron	Can effectively classify emails as legitimate or phishing emails.	Neglecting user privacy protection.
Chinta et al. [51]	BERT-LSTM	Phishing email	Effectively capture the complex patterns in phishing emails.	Has high computational demands, but does not focus on user privacy protection.
Thapa et al. [52]	FL and BERT	IWSPA-AP Nazario Enron	FL is applied to the phishing email detection domain within a collaborative distributed framework.	The performance of both local and global models relies on the BERT model, with the global model’s test results being unstable.
Kaushal et al. [54]	FL and Fair Clustering model	Enron	Enhanced data fairness and privacy protection, pushing FL towards broader applications in privacy-sensitive domains.	Although data fairness is enhanced, the model’s accuracy is not high.
Venčkauskas et al. [55]	FL and ML Model	CDMC2010 Enron SpamAssasin TREC07	The proposed method can effectively defend against Byzantine attacks.	Although it can effectively defend against Byzantine attacks, the performance of the method still needs improvement.
Anh et al. [56]	FL and PhoBERT	Vietnamese email	Several aggregation algorithms have been proposed, which demonstrate good performance on both IID and non-IID data.	The dataset is limited in size, and the approach may be particularly effective only for Vietnamese.

Table 2. Dataset introduction.

Dataset	Ham	Spam	Total
CEAS	21,783	17,139	38,922
Enron	16,544	17,110	33,654
Ling	2412	481	2893
Phishing_email	42,849	39,595	82,444
Spam_email	30,496	27,443	57,939
Fake_phishing	2000 (AI)	2476	4476

Table 3. HRSS algorithm parameter description.

Symbol	Parameter Name	Range	Description
$θ^{0}$	Initial Global Model	-	Initial model parameters before federated learning starts
$θ^{T}$	Final Global Model	-	Optimized global model after T training rounds
$θ^{t}$	Global Model at Round t	-	Global model parameters after round t
$θ_{i}^{t}$	Client Local Model	-	Local model parameters of client i at round t
$C$	Client Set	${1, 2, \dots, N}$	All clients participating in federated learning
$S_{selected}$	Selected Client Subset	$\subset C$	Subset of clients selected for training in each round
T	Total Training Rounds	$T \geq 1$	Total number of communication rounds in federated learning
t	Current Round Index	$1 \leq t \leq T$	Index of the current training round
phase	Cycle Phase	${0, 1, 2}$	Current phase position in the 3-round cycle
$s_{i}$	Quality Score	$s_{i} \in R$	Quality assessment score of client i’s model parameters
N	Total Clients	$N \geq 1$	Total number of clients participating in federated learning

Table 4. Performance of the six datasets in Fedavg and Fedprox for accuracy, precision, recall, and F1-score.

	FedAvg				FedProx
Dataset	Accuracy	Precision	Recall	F1-Score	Accuracy	Precision	Recall	F1-Score
CEAS	98.96	98.96	98.98	98.93	99.40	99.39	99.39	99.38
Enron	98.07	98.08	98.13	98.07	98.50	98.52	98.52	98.51
Ling	98.25	98.08	94.64	96.16	99.13	99.04	98.04	98.03
Phishing_email	98.51	98.51	98.47	98.50	99.15	99.15	99.14	99.14
Spam_email	97.25	97.23	97.29	97.21	97.92	97.90	97.90	97.88
Fake_phishing	99.24	99.24	99.29	99.22	99.78	99.77	99.76	99.77

All the experiments in this table and the subsequent tables were conducted using the horse selection strategy and with the client set to 10, and all the values in this table are percentage values.

Table 5. Performance of the six datasets in Fedavg and Fedprox for FDR, FNR, FPR, NPV, and SPC.

	FedAvg					FedProx
Dataset	FDR	FNR	FPR	NPV	SPC	FDR	FNR	FPR	NPV	SPC
CEAS	0.661	1.184	0.846	98.48	99.15	0.499	0.563	0.641	99.26	99.21
Enron	2.946	1.020	2.718	99.07	97.28	1.755	1.384	1.567	98.78	98.43
Ling	1.904	10.33	0.379	98.13	99.22	3.222	3.333	0.569	99.45	99.43
Phishing_email	1.830	1.021	2.022	98.93	97.97	0.751	0.901	0.807	99.06	98.97
Spam_email	3.825	1.940	3.466	98.21	96.53	2.235	2.067	2.113	97.94	98.03
Fake_phishing	0.185	0.742	0.274	98.96	97.72	0.193	0.213	0.259	99.76	98.29

Table 6. Performance of the six datasets with using HRSS strategy in terms of accuracy, precision, recall, F1-score, and AUC.

Model	Dataset	Accuracy	Precision	Recall	F1-Score	AUC
FPW–Attention–CNN	CEAS	98.50	98.5 0	98.41	98.46	99.79
FPW-BiGRU-CNN		98.23	98.23	98.19	98.19	99.76
FPW-BC		99.40	99.39	99.39	99.38	99.89
FPW–Attention–CNN	Enron	97.60	97.58	97.65	97.56	99.58
FPW-BiGRU-CNN		97.93	97.93	97.96	97.92	99.69
FPW-BC		98.50	98.52	98.52	98.51	99.83
FPW–Attention–CNN	Ling	98.25	98.26	96.86	96.65	99.50
FPW-BiGRU-CNN		98.43	98.41	96.59	97.08	99.65
FPW-BC		99.13	99.04	98.04	98.03	99.87
FPW–Attention–CNN	Phishing_email	98.31	98.31	98.27	98.29	99.81
FPW-BiGRU-CNN		98.76	98.75	98.75	98.74	99.78
FPW-BC		99.15	99.15	99.14	99.14	99.93
FPW–Attention–CNN	Spam_email	97.26	97.24	97.28	97.21	99.61
FPW-BiGRU-CNN		97.15	97.14	97.18	97.11	99.66
FPW-BC		97.92	97.90	97.90	97.88	99.76
FPW–Attention–CNN	Fake_email	99.22	99.21	99.14	99.20	99.89
FPW-BiGRU-CNN		99.33	99.33	99.25	99.31	99.78
FPW-BC		99.78	99.77	99.76	99.77	99.99

Table 7. Compare the performance of the recent methods.

Literature	Method	Algorithm	Dataset	Accuracy	Precision	Recall	F1-Score
Zavrak et al. [23]	FT + HAN	-	TREC 2007	99.2	98.9	99.9	99.4
			GenSpam	95.4	95.3	95.7	95.5
			SpamAssasin	95.5	89.3	97.8	93.3
			Enron	95.8	98.1	93.7	95.8
			Ling	98.0	93.3	94.8	94.0
Brindha et al. [50]	ICSOA-DLPEC	-	Enron	99.72	99.02	99.63	99.33
Chinta et al. [51]	BERT-LSTM	-	Phishing email	99.55	99.61	99.55	99.24
Ul Haq et al. [47]	FL + NLP	FedAvg	CEAS	96.0	-	-	-
Thapa et al. [52]	FL + BERT	FedAvg	IWSPA-AP	80–99	-	-	-
			Nazario		-	-	-
			Enron		-	-	-
			CSIRO		-	-	-
			Phishbowl		-	-	-
Kaushal et al. [53]	FL + Fair Clustering Model	FedAvg	Enron	80.49	75.0	80.0	77.4
Venčkauskas et al. [55]	FL + ML Model	FedAvg	CDMC2010	80.0	72.0	81.0	77.0
			Enron
			SpamAssasin
Anh et al. [56]	FL + PhoBERT	FedAvg	Vietnamese Email	97.25	-	-	-
		FedAvgM		87.03	-	-	-
		FedAdam		98.5	-	-	-
Ours	FL + FDW-BC	FedProx	CEAS	99.40	99.39	99.39	99.38
			Enron	98.50	98.52	98.52	98.51
			Ling	99.13	99.04	98.04	98.03
			Phishing email	99.15	99.15	99.14	99.14
			Spam email	97.92	97.90	97.90	97.88
			Fake phishing	99.78	99.77	99.76	99.77

Table 8. Comparison of ablation experiments on six datasets in the same experimental environment.

Model	Dataset	Accuracy	Precision	Recall	F1-Score	AUC
FPW-CNN	CEAS	96.27	96.28	96.09	96.18	99.05
FPW-BiLSTM		94.57	94.56	94.02	94.38	99.31
FPW-BiLSTM-CNN		99.40	99.39	99.39	99.38	99.89
FPW-CNN	Enron	96.15	96.13	96.16	96.11	99.05
FPW-BiLSTM		96.54	96.53	96.61	96.49	99.50
FPW-BiLSTM-CNN		98.50	98.52	98.52	98.51	99.83
FPW-CNN	Ling	96.50	96.52	89.96	92.98	99.48
FPW-BiLSTM		96.85	96.83	93.23	94.38	99.51
FPW-BiLSTM-CNN		99.13	99.04	98.04	98.03	99.87
FPW-CNN	Phishing_email	96.25	96.25	96.22	96.21	99.30
FPW-BiLSTM		96.83	96.83	96.82	96.80	99.51
FPW-BiLSTM-CNN		99.15	99.15	99.14	99.14	99.93
FPW-CNN	Spam_email	95.23	95.22	95.15	95.16	99.07
FPW-BiLSTM		95.90	95.91	95.95	95.87	99.20
FPW-BiLSTM-CNN		97.92	97.90	97.90	97.88	99.76
FPW-CNN	Fake_phishing	97.14	97.14	97.16	97.10	99.60
FPW-BiLSTM		96.68	96.68	96.65	96.64	99.52
FPW-BiLSTM-CNN		99.78	99.77	99.76	99.77	99.99

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Xiong, Y.; Cao, J.; Chen, G. Federated Learning Spam Detection Based on FedProx and Multi-Level Multi-Feature Fusion. Informatics 2025, 12, 93. https://doi.org/10.3390/informatics12030093

AMA Style

Xiong Y, Cao J, Chen G. Federated Learning Spam Detection Based on FedProx and Multi-Level Multi-Feature Fusion. Informatics. 2025; 12(3):93. https://doi.org/10.3390/informatics12030093

Chicago/Turabian Style

Xiong, Yunpeng, Junkuo Cao, and Guolian Chen. 2025. "Federated Learning Spam Detection Based on FedProx and Multi-Level Multi-Feature Fusion" Informatics 12, no. 3: 93. https://doi.org/10.3390/informatics12030093

APA Style

Xiong, Y., Cao, J., & Chen, G. (2025). Federated Learning Spam Detection Based on FedProx and Multi-Level Multi-Feature Fusion. Informatics, 12(3), 93. https://doi.org/10.3390/informatics12030093

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Federated Learning Spam Detection Based on FedProx and Multi-Level Multi-Feature Fusion

Abstract

1. Introduction

2. Related Research

3. Preliminaries

3.1. Data Description

3.2. Horse-Racing Selection Strategy

3.3. Word Vector Conversion

3.4. Neural Network Model

3.4.1. BiLSTM

3.4.2. CNN

3.5. FedProx Aggregation Algorithm

4. Method

4.1. Federated Learning Architectural

4.2. Structure Design of Model

5. Experimental and Discussion

5.1. Evaluation Indicators

5.2. Experimental Comparison

5.2.1. Comparison of Aggregation Algorithms

5.2.2. The Impact of the Number of Clients on the FPW-BC Model

5.2.3. The Resource Cost of the Aggregation Algorithm

5.2.4. The Impact of the Horse-Racing Selection Strategy on the FPW-BC Model

5.2.5. Federated Learning and Hybrid Neural Network Model Fusion Experiment

5.2.6. Experimental Analysis on Imbalanced Datasets

5.2.7. Experimental Analysis on the Fake Email Dataset

5.2.8. Comparison of FPW-BC with Other Methods

5.2.9. Ablation Experiment

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI