Federated Learning for Anomaly Detection: A Systematic Review on Scalability, Adaptability, and Benchmarking Framework

Lim, Le-Hang; Ong, Lee-Yeng; Leow, Meng-Chew

doi:10.3390/fi17080375

Open AccessSystematic Review

Federated Learning for Anomaly Detection: A Systematic Review on Scalability, Adaptability, and Benchmarking Framework

by

Le-Hang Lim

,

Lee-Yeng Ong

^*

and

Meng-Chew Leow

Faculty of Information Science and Technology, Multimedia University, Jalan Ayer Keroh Lama, Melaka 75450, Malaysia

^*

Author to whom correspondence should be addressed.

Future Internet 2025, 17(8), 375; https://doi.org/10.3390/fi17080375

Submission received: 23 June 2025 / Revised: 23 July 2025 / Accepted: 12 August 2025 / Published: 18 August 2025

Download

Browse Figures

Versions Notes

Abstract

Anomaly detection plays an increasingly important role in maintaining the stability and reliability of modern distributed systems. Federated Learning (FL) is an emerging method that shows strong potential in enabling anomaly detection across decentralised environments. However, there are some crucial and tricky research challenges that remain unresolved, such as ensuring scalability, adaptability to dynamic server clusters, and the development of standardised evaluation frameworks for FL. This review aims to address the research gaps through a comprehensive analysis of existing studies. In this paper, a systematic review is conducted by covering three main aspects of the application of FL in anomaly detection: the impact of communication overhead towards scalability and real-time performance, the adaptability of FL frameworks to dynamic server clusters, and the key components required for a standardised benchmarking framework of FL-based anomaly detection. There are a total of 43 relevant articles, published between 2020 and 2025, which were selected from IEEE Xplore, Scopus, and ArXiv. The research findings highlight the potential of asynchronous updates and selective update mechanisms in improving FL’s real-time performance and scalability. This review primarily focuses on anomaly detection tasks in distributed system environments, such as network traffic analysis, IoT devices, and industrial monitoring, rather than domains like computer vision or financial fraud detection. While FL frameworks can handle dynamic client changes, the problem of data heterogeneity among the clients remains a significant obstacle that affects the model convergence speed. Moreover, the lack of a unified benchmarking framework to evaluate the performance of FL in anomaly detection poses a challenge to fair comparisons among the experimental results.

Keywords:

federated learning; anomaly detection; scalability; real-time performance; dynamic server clusters; benchmarking framework

Graphical Abstract

1. Introduction

Anomaly detection plays a crucial role in maintaining the stability and reliability of modern distributed systems. Federated Learning (FL), as an emerging privacy-preserving machine learning paradigm, shows strong potential in enabling anomaly detection across decentralised environments. FL supports collaborative model training across distributed participants without the need to share raw data. Instead, only the trained model parameters are transmitted to a central server for aggregation, thus preserving data privacy and reducing communication risks [1]. This method is particularly effective in scenarios involving large-scale and dynamic client populations, as it allows each participant to remain autonomous while still contributing to a globally improved model [2]. However, several key challenges remain unresolved, such as ensuring scalability, adapting to dynamic server clusters, and developing standardised evaluation frameworks for FL-based anomaly detection. These challenges must be addressed to fully realise the potential of FL in real-world applications.

Most of the traditional anomaly detection methods are rule-based, relying on certain predefined rules, thresholds, or statistical principles to detect anomalies [3]. While this method is easy to implement, its limited flexibility poses challenges in handling complex scenarios, particularly those involving high-dimensional or non-linear data. Moreover, traditional methods often fail to detect early-stage or subtle anomalies. With the advancement of artificial intelligence, various machine learning methods—such as Support Vector Machines (SVMs) [4], Random Forests [5], and K-means clustering [6]—have been widely adopted for anomaly detection.

Although these methods have shown promising performance, they are typically deployed in centralised settings that require the aggregation of all data at a single location, leading to increased privacy risks and higher computational demands. To address these concerns, FL has been applied in the task of anomaly detection. FL addresses these concerns by enabling decentralised model training across multiple clients without requiring raw data to be shared [2]. Instead, only model parameters are exchanged, preserving data privacy and client autonomy. Despite these advantages, FL also introduces new challenges. Ensuring scalability, maintaining communication efficiency, and enhancing adaptability are particularly critical for FL-based anomaly detection. Additionally, there is a lack of standardised benchmarking frameworks, making it difficult to fairly and consistently evaluate the performance of FL methods across different studies.

Most of the existing research has focused on comparing the centralised machine learning methods with FL in terms of training time and accuracy [7,8]. In general, the results suggest that the models trained using the FL framework are able to maintain a high anomaly detection accuracy while significantly reducing training time compared to centralised methods [9]. This efficiency stems from FL’s ability to distribute model training across multiple participating nodes, enhancing computational performance. However, studies also show that the centralised machine learning model consistently achieves slightly higher accuracy than FL. This outcome is expected because the centralised models have direct access to the complete dataset, whereas FL models rely on local data and aggregated parameters, limiting their exposure to the full data distribution [10]. Despite this trade-off, FL demonstrates strong potential for anomaly detection by balancing accuracy with computational efficiency. Nevertheless, several practical considerations must be addressed when applying FL in the real-world settings. Firstly, scalability is one of the significant challenges of FL. One key question is as follows: How does FL maintain its performance and efficiency as the number of participants increases? Controlling the communication overhead while maintaining the performance of a FL system as the participating clients grows is a major obstacle. Moreover, the adaptability of FL in the dynamic server environment is also an important feature. If the servers frequently join and leave, can the FL adapt to such a situation without affecting the model performance? We still do not have a standardised benchmarking framework for FL-based anomaly detection. Existing research often uses different experiment setups and performance metrics, making it difficult to compare the results consistently.

To the extent of our knowledge, there are five studies that reviewed FL for anomaly detection across various domains. However, some research gaps still remain in the existing review papers, which limit their comprehensiveness and applicability, specifically in the context of scalability, adaptability, and the completeness of the benchmarking framework design.

Paper [11] involved a comprehensive survey of FL-based intrusion detection systems, especially in IoT, IIoT, IoV, and other distributed environments. It does partially address the problem of adaptability of FL in dynamic environments through discussions on asynchronous updates and hierarchical FL. However, it lacks discussion on the characteristics of a formalised or unified benchmarking framework.

Paper [12] focuses on reviewing the benefits, architectures, and limitations of existing FL-based intrusion detection systems. However, it does not clearly discuss the adaptability of FL to dynamic server clusters. Moreover, it does not provide insight in developing a standardisation or benchmarking framework for FL in anomaly detection.

Paper [13] focuses on architectures, deployment strategies, and future improvements of FL-based intrusion detection system. It partially discusses the concept of dynamic and heterogeneous client environments which are indirectly related to the adaptability of FL in dynamic environments. However, it does not consider dynamic server-cluster reconfiguration, such as the real-time addition or removal of nodes. In addition, this paper does not formulate or analyse a standard benchmarking framework on unified datasets, evaluation protocols, or performance baselines for FL-based anomaly detection.

A systematic review of anomaly detection on non-IID (non-independent and identically distributed) data across various domains is discussed in paper [14]. While comprehensive in its scope of non-IID challenges, this paper does not place emphasis on addressing the problem of communication overhead. The exploration of obstacles such as scalability and real-time performance concern, which are crucial in FL environments, was not included. Additionally, this review does not cover the adaptation mechanisms for dynamic server infrastructures in FL-based anomaly detection.

Paper [15] highlights the application of FL in smart-building environments, covering use cases like anomaly detection, energy prediction, thermal comfort, and healthcare. It discusses concerns such as data privacy, communication overhead, and heterogeneity in smart-building IoT systems. Although the paper identifies important components to build a benchmarking framework such as evaluation metrics and datasets, it does not deeply analyse what makes these components important for a standardised benchmarking framework.

This paper aims to address the aforementioned research gaps of the existing review paper. In this paper, a systematic review is conducted by covering three main aspects of the application of FL in anomaly detection: the impact of communication overhead towards scalability and real-time performance, the adaptability of FL frameworks to dynamic server clusters, and the key components required for a standardised benchmarking framework of FL-based anomaly detection. The contributions of this systematic review can be concluded as follows:

Examine the impact of communication overhead towards scalability and real-time performance and provide possible solutions.
Analyse the adaption of FL frameworks to dynamic server clusters.
Identify key components for a standardised benchmarking framework to evaluate FL in anomaly detection.

The organisation of this paper is as follows: Section 1 provides a brief introduction and overview of this systematic review. Section 2 provides an overview of FL for anomaly detection. Section 3 discusses the methodology for selecting and analysing research papers. Section 4 shows the key findings obtained from selected research papers and a discussion. Section 5 lists the limitations of this systematic review paper. Section 6 is the conclusion, which summarises all of the important points in this report and outlines the future work.

2. Overview of FL

McMahan et al. first introduced the concept of Federated Learning in 2016 [16]. The idea of Federated Learning (FL) is to connect loosely coupled clients through a central server, allowing them to learn and train together to solve problems. FL is a decentralised machine learning concept composed of multiple clients and a central server. Each client has its own independent dataset and machine learning model. Although the clients train a global model together, they do not share their raw data. Instead, they share their learning outcomes by uploading trained parameters such as gradients and weights [16]. After each training round, the global model summarises the parameters uploaded by each client and aggregates them using algorithms like FedAvg. The updated global model parameters are then sent to all clients for reference. The purpose of FL is to distribute training workloads while ensuring each client’s data independence. At the end of each training round, each client uploads its parameters to the central server. These parameters represent the learning outcomes of the clients in that training round. The central server aggregates the parameters from all clients into a global parameter. This iterative process enables collaboration between clients for model training while preserving data independence and reducing the need for direct data sharing.

FL is more well-suited for anomaly detection tasks compared to traditional centralised machine learning methods. Firstly, traditional centralised machine learning relies on one single central server to collect and store data. This will raise concerns about data privacy, data breaches, and unauthorised access [17]. In contrast, FL performs model training locally without sharing raw data [18]. Each client only needs to transmit model parameters after training, ensuring strong data protection while maintaining great model performance. Moreover, the scalability of a centralised model is limited because a heavy computational burden and workload are placed on a single central server. FL distributes model training across multiple clients, reducing the computational burden on a central server. This method significantly improves computational efficiency, especially in large-scale server infrastructures. On the other hand, if the training can be carried out locally, it eliminates the need to transfer large amounts of raw data between clients and central servers. This can reduce the bandwidth requirement and lower the cost of data communication. With these strengths, FL is emerging as a promising solution for anomaly detection.

Various FL methods are being explored for anomaly detection. The common FL anomaly detection methods include supervised, semi-supervised, and unsupervised methods. These methods can be distinguished by the existence of labels in the training data. In a supervised FL anomaly detection method, both the regular and anomaly data are well-labelled. In semi-supervised FL, only the regular data are used during the training phase. The data that deviate from the regular data are classified as anomalies. If the training data do not include labels, the task is handled using the unsupervised method [19].

All the existing studies show that FL is well-suited for anomaly detection tasks, especially unsupervised FL, which is able to learn from diverse datasets effectively and achieve great performance through its collaborative learning feature. For instance, paper [20] showed that the FL-based model can achieve a lower false-positive rate (FPR) compared to the non-FL models. This observed improvement in FPR can be attributed to several characteristics of both the FL framework and the nature of IoT-based intrusion detection. First, FL enables local model training on device-specific data, allowing each client to capture fine-grained patterns of normal behaviour. This localised learning helps reduce the misclassification of benign variations as anomalies. Second, the use of unsupervised autoencoders in the FL setting facilitates the learning of accurate reconstructions of normal input without relying on noisy or incomplete labels, thereby reducing false alarms. Third, FL’s continual local training process supports model adaptability in dynamic environments, which helps maintain detection accuracy as device behaviour evolves. Finally, by preserving data granularity and avoiding centralised data aggregation bias, FL models can achieve better specificity and generalisation, especially in heterogeneous IoT systems.

Paper [21] also confirmed the feasibility of collaborative learning methods of FL in anomaly detection tasks. However, FL still has some limitations. First, the performance of the FL model degraded when the number of clients surpassed a certain threshold, limiting the scalability of the system. This is because as the number of clients increases, the communication cost and computational overhead also increase. This is a common challenge faced in FL research, as shown in papers [19,22,23,24], in which all of these studies encountered problems related to computational overhead. This scalability constraint raises concerns about real-time performance and dynamic client management in FL environments. In addition, the problem of data heterogeneity, specifically the non-independent and identically distributed (non-IID) data, has a negative effect on the performance of the FL-based model, where the authors of papers [25,26] faced similar problems in their studies. Additionally, most of the existing studies lack a standardised benchmarking framework for evaluating the performance of FL-based anomaly detection models. A standardised benchmarking framework is essential for fair comparisons across different datasets, models, or experiment setups, allowing researchers to assess the experiment results objectively. Table 1 summarises various studies that have applied supervised, semi-supervised, and unsupervised Federated Learning (FL) for anomaly detection. These studies provide valuable insights into the feasibility and effectiveness of FL, while also underlining the concerns associated with FL implementation such as computational overhead and data heterogeneity.

3. Materials and Methods

This systematic review follows the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) methodology to ensure transparency and reproducibility in the review process. This methodology includes four main stages: identification, screening, eligibility, and inclusion, as shown in Figure 1.

3.1. Research Questions

Three research questions are formulated to achieve the objective of this review paper, where RQ 1 focuses on examining the impact of communication overhead towards scalability and real-time performance of FL, RQ 2 investigates the existing research on adapting FL frameworks to dynamic server clusters, and RQ3 identifies key components required to develop a standardised benchmarking framework to evaluate FL-based anomaly detection.

RQ 1—How does communication overhead impact the scalability and real-time performance of Federated Learning for anomaly detection in distributed server environments?
RQ 2—How can a Federated Learning framework adapt to dynamic server clusters?
RQ3—What key components are required to develop a standardised benchmarking framework for evaluating the performance of Federated Learning anomaly detection?

3.2. Data Search Strategies

A good search strategy is essential to obtain high-quality literature. In this stage, the sources of the study and keywords used for searching the articles are determined. All selected articles are in English and are obtained from the digital libraries IEEE Xplore, Scopus, and ArXiv. In addition, keywords derived from the RQs are used to search through databases for relevant literature. Boolean logic operators (“AND” & “OR”) are also utilised to broaden the search, ensuring the articles contain at least one of the specified terms included. Table 2 shows the defined keywords and the corresponding Boolean search strings employed in this study. A total of 1132 papers were collected in the first round of searching.

3.3. Paper Selection

There were a total of 1132 papers collected initially, using “Federated Learning” and “Anomaly Detection” as the search keywords. After that, these 1132 papers went through a filtering process based on predefined selection criteria shown in Table 3. First, the papers must be written in English and published between 2020 and 2025. Then, only journal and conference papers were considered, while other publication types like white papers, theses, and dissertations were excluded. Papers were screened based on their titles to determine their relevance to each RQ using the corresponding keywords. In this filtering stage, 385 papers were selected, where 95 papers related to RQ1, 184 papers related to RQ2, and 106 papers related to RQ3. From these 385 papers, 193 were from Scopus, 162 were from IEEE, and 30 were from ArXiv. After that, these papers were further analysed based on their abstracts to ensure that they addressed at least one of the RQs. The duplicated papers across different digital libraries were removed. The references in the selected paper were also investigated, and the search strategy was applied to them. After this final filtering step, 43 papers were selected, where 31 papers were from Scopus, 4 papers were from IEEE, and 8 papers were from ArXiv. Among these 43 selected papers, 11 papers address RQ1, 19 papers address RQ2, and 13 papers address RQ3. Table 4 presents a detailed breakdown of the number of papers retained after each filtering stage.

3.4. Data Extraction and Synthesis

After selecting the relevant literature, the next step is to extract the information and knowledge to answer the RQs. Firstly, the abstracts of the selected papers are reviewed to identify the core technological contribution in relation to each of the RQ. The first step after confirming the set of papers is data charting. Data charting plays a critical role in enhancing the transparency of the review process and facilitating systematic analysis. In this review paper, a structured approach is applied with the data charting to extract the important information from the selected papers. The results of the data charting are presented in Table 5, Table 6, Table 7, Table 8 and Table 9 to support different aspects of the RQ. Table 5 shows bibliographic and classification details, which include the fields such as title, publication year, publication type, source, and the relevant RQ addressed. Table 6 summarises the methods proposed by the particular study to mitigate communication overhead in FL-based anomaly detection. Table 7 lists the methods used to adapt FL frameworks to dynamic server clusters. Table 8 introduces the characteristics of some benchmarking datasets. Table 9 highlights the experiment setups applied in the reviewed studies, which include the information such as dataset, data type, model used, FL methods, and FL aggregation algorithms. With these tables, a well-structured synthesis and comparative analysis across studies is achieved, significantly enhancing the consistency and completeness of answering the RQs. After collecting the required information, the extracted insights are synthesised to construct a comprehensive understanding of the addressed challenges and proposed solutions. Then, the insights are categorised to provide the answer to the RQs. This synthesis builds a solid foundation for the next section, which presents the results of this review.

4. Results and Discussion

This section discusses the results of the review. After applying the research methodology, a total of 43 papers were selected based on their relevance to the RQ. Table 5 shows detailed information about the selected articles, including the paper source, title, type, publication year, and corresponding RQ. Specifically, there are a total of 11 papers related to RQ1 that discuss the impact of communication overhead towards scalability and real-time performance. RQ2, which discusses the adaptability of FL frameworks to dynamic server clusters, is addressed by the 19 selected papers. There are also 13 papers focusing on the key components required for a standardised benchmarking framework of FL-based anomaly detection. Figure 2 illustrates the number of papers related to each RQ and their publication year. The increasing number of studies reflect the growing maturity of FL technology over the years. Figure 3 categorises the selected papers by publication types, comprising 17 conference papers, 18 journal papers, and 8 pre-printed papers sourced from the open-access repository, ArXiv.

Table 5. Selected research articles.

No.	Databases Searched	Title	Publication Type	Publication Year	RQ	Ref.
1	Scopus	Anomaly Detection from Distributed Data Sources via Federated Learning	Conference	2022	RQ1	[19]
2	Scopus	FedMSE: Semi-supervised federated learning approach for IoT network intrusion detection	Journal	2025	RQ1	[24]
3	Scopus	FedTADBench: Federated Time-series Anomaly Detection Benchmark	Conference	2022	RQ1	[28]
4	ArXiv	Federated Semi-Supervised and Semi-Asynchronous Learning for Anomaly Detection in IoT Networks	Journal	2023	RQ1	[29]
5	Scopus	Communication-Efficient Federated Learning for Network Traffic Anomaly Detection	Conference	2023	RQ1	[30]
6	Scopus	Adaptive Hierarchical GHSOM with Federated Learning for Context-Aware Anomaly Detection in IoT Networks	Conference	2024	RQ1	[31]
7	IEEE	Asynchronous Real-Time Federated Learning for Anomaly Detection in Microservice Cloud Applications	Journal	2025	RQ1	[32]
8	ArXiv	TemporalFED: Detecting Cyberattacks in Industrial Time-Series Data Using Decentralized Federated Learning	Journal	2023	RQ1	[33]
9	IEEE	Quantized Distributed Federated Learning for Industrial Internet of Things	Journal	2021	RQ1	[34]
10	Scopus	Decentralized Federated Learning for Industrial IoT With Deep Echo State Networks	Journal	2023	RQ1	[35]
11	Scopus	FedSA: A Semi-Asynchronous Federated Learning Mechanism in Heterogeneous Edge Computing	Journal	2021	RQ1	[36]
12	IEEE	Dynamic Clustering in Federated Learning	Conference	2021	RQ2	[37]
13	Scopus	Federated Learning in Dynamic and Heterogeneous Environments: Advantages, Performances, and Privacy Problems	Journal	2024	RQ2	[38]
14	Scopus	DCFL: Dynamic Clustered Federated Learning under Differential Privacy Settings	Conference	2023	RQ2	[39]
15	Scopus	Fed-RAC: Resource-Aware Clustering for Tackling Heterogeneity of Participants in Federated Learning	Journal	2024	RQ2	[40]
16	Scopus	Highlight Every Step: Knowledge Distillation via Collaborative Teaching	Journal	2022	RQ2	[41]
17	Scopus	A dynamic adaptive iterative clustered federated learning scheme	Journal	2023	RQ2	[42]
18	Scopus	Clustered Federated Learning: Model-Agnostic Distributed Multitask Optimization Under Privacy Constraints	Journal	2021	RQ2	[43]
19	ArXiv	FedAC: An Adaptive Clustered Federated Learning Framework for Heterogeneous Data	Journal	2024	RQ2	[44]
20	Scopus	FedGroup: Efficient Federated Learning via Decomposed Similarity-Based Clustering	Conference	2021	RQ2	[45]
21	Scopus	Multi-center federated learning: clients clustering for better personalization	Journal	2023	RQ2	[46]
22	ArXiv	Towards Client-Driven Federated Learning	Journal	2024	RQ2	[47]
23	ArXiv	Three Approaches for Personalization with Applications to Federated Learning	Journal	2020	RQ2	[48]
24	Scopus	FedSoft: Soft Clustered Federated Learning with Proximal Local Updating	Conference	2022	RQ2	[49]
25	ArXiv	IP-FL: Incentivized and Personalized Federated Learning	Journal	2024	RQ2	[50]
26	Scopus	An Efficient Framework for Clustered Federated Learning	Journal	2022	RQ2	[51]
27	Scopus	Temporal Adaptive Clustering for Heterogeneous Clients in Federated Learning	Conference	2024	RQ2	[52]
28	Scopus	Efficient Cluster Selection for Personalized Federated Learning: A Multi-Armed Bandit Approach	Conference	2023	RQ2	[53]
29	Scopus	Automated Collaborator Selection for Federated Learning with Multi-armed Bandit Agents	Conference	2021	RQ2	[54]
30	Scopus	Multi-Armed Bandit-Based Client Scheduling for Federated Learning	Journal	2020	RQ2	[55]
31	Scopus	FedAD-Bench: A Unified Benchmark for Federated Unsupervised Anomaly Detection in Tabular Data	Conference	2024	RQ3	[56]
32	ArXiv	Anomaly Detection in Large-Scale Cloud Systems: An Industry Case and Dataset	Journal	2024	RQ3	[57]
33	Scopus	Federated Learning for Network Anomaly Detection in a Distributed Industrial Environment	Conference	2023	RQ3	[58]
34	Scopus	Performance Evaluation of Federated Learning for Anomaly Network Detection	Conference	2023	RQ3	[59]
35	Scopus	Anomaly Detection via Federated Learning	Conference	2023	RQ3	[60]
36	Scopus	Privacy-Preserving Federated Learning-Based Intrusion Detection Technique for Cyber-Physical Systems	Journal	2024	RQ3	[61]
37	Scopus	Intrusion Detection Approach for Industrial Internet of Things Traffic Using Deep Recurrent Reinforcement Learning Assisted Federated Learning	Journal	2025	RQ3	[62]
38	Scopus	A Federated Learning Approach for Efficient Anomaly Detection in Electric Power Steering Systems	Journal	2024	RQ3	[63]
39	Scopus	CAFNet: Compressed Autoencoder-based Federated Network for Anomaly Detection	Conference	2023	RQ3	[64]
40	Scopus	Federated Learning for Cloud and Edge Security: A Systematic Review of Challenges and AI Opportunities	Journal	2025	RQ3	[65]
41	ArXiv	UNIFED: ALL-IN-ONE FEDERATED LEARNING PLATFORM TO UNIFY OPEN-SOURCE FRAMEWORKS	Journal	2022	RQ3	[66]
42	Scopus	Trust-Based Anomaly Detection in Federated Edge Learning	Conference	2024	RQ3	[67]
43	Scopus	Fed-ANIDS: Federated learning for anomaly-based network intrusion detection systems	Journal	2023	RQ3	[68]

4.1. Research Question 1—How Does Communication Overhead Impact the Scalability and Real-Time Performance of Federated Learning for Anomaly Detection in Distributed Server Environments?

As discussed above, FL shows a strong potential for anomaly detection in distributed environments due to its ability to protect data privacy and enhance system robustness through distributing the training process. However, to fully realise FL’s potential in the anomaly detection task, scalability and real-time performance are the challenges that need to be addressed, both of which are significantly constrained by communication overhead. The real-time performance is crucial for anomaly detection systems, but the increasing number of clients often leads to an increase in communication latency, which can hinder the system’s efficiency.

Communication overhead refers to the computational resources, bandwidth, and time required for data transmission in a distributed learning system. Since FL relies on the continuous transmission of model parameters between clients and the central server, having more clients increases the demand for bandwidth and computing power. Studies from [19,24] show that the increment in number of participating clients negatively impacts the performance of FL. For example, in [19], FL models performed well when they only included 15 clients. As the client count increased to 30, the effectiveness of FL models dropped significantly. Similarly, the mean accuracy of the model is declined as the number of gateways increase in the experiment [24]. These findings show that when the requirement of communication costs and computational resources grows, the system may become overloaded, leading to a decline in model performance.

One of the primary contributors to communication overhead in FL is the choice of aggregation algorithms. A more sophisticated aggregation method like MOON in [28] and FedS3A in [29] can improve model quality, but it often comes with a higher computational complexity and communication overhead. Then, the real-time performance of the FL framework may be affected because the complicated aggregation method needs more time for model convergence, leading to a high latency. In contrast, simpler algorithms such as FedAvg offer faster training times but may sacrifice accuracy [28,29].

There are several strategies that proposed to minimised communication overhead. The first method is model compression and sparse updates such as the Singular-Value Decomposition (SVD) [30] or difference transmission [29], which can help to reduce the size of parameter updates. For example, the proposed framework eFedAD utilised Singular-Value Decomposition (SVD) in [30] to compress model parameters before transmission. This method minimised bandwidth usage by reducing the number of updated parameters, thereby reducing the communication overhead. Similarly, the difference-transmission method applied in [29] enables the clients to send only the difference between local and global models in the form of a Sparse Matrix. This method helps to reduce the amount of data that needs to be transmitted while maintaining model performance.

By eliminating the principle of transmitting model updates after every training iteration, the selective federated update mechanism applied in [31] allows the clients to transmit model updates to the central server only when the detected anomalies surpass the predefined thresholds. This mechanism helps to reduce the unnecessary bandwidth and the workload of central servers by filtering out the trivial updates and prioritising the meaningful model updates. Additionally, the weighted aggregation mechanism applied in [30] assigns higher weights to the client that has a more diverse and novel parameter, filtering out the high similarity client parameter to ensure the server learns diverse data in every round of training, thereby accelerating the convergence speed.

Traditional aggregation algorithm such as FedAvg applies synchronous FL methods, where the central server aggregates the parameters after all of the clients complete their updates, introducing delays due to slower participants. In contrast, ART-FL proposed in [32] leverages asynchronous model updates to enhance the efficiency of FL system. Asynchronous model updates allow the central server to aggregate the updates as soon as they are received, improving real-time responsiveness and preventing stragglers from delaying the entire process.

Refs. [33,34,35] explored Decentralised Federated Learning (DFL) and Semi-Decentralised Federated Learning (SDFL) to reduce communication overhead. Although FL is a distributed learning method that allows clients to train locally, it still needs a central server to handle updated client parameters. In contrast, DFL removes the need for a central server by enabling peer-to-peer communication, allowing clients to communicate with each other directly. The SDFL introduces local aggregators, which are responsible for aggregating model updates from their neighbouring clients. The local communication characteristics of DFL and SDFL reduce the communication pressure on the central server as the number of clients increases and also avoid the single point of failure of the central server, which could otherwise lead to a complete system breakdown.

Next, the client-selection techniques applied in [30] are also an effective way to reduce redundant communication. The clients are grouped into clusters based on the similarity of their data feature distribution. Then, each cluster will only select one client as the representative to transmit the parameter. In addition, the adaptive learning rates that are applied in [29,36] are helpful to balance the contribution of each client by adjusting the learning rate of each client dynamically based on their participation frequency. This balances the contributions of different clients and reduces the need for frequent communication, thereby lowering communication overhead.

Although the reviewed studies report significant performance gains in communication efficiency and scalability, many of these results rely on idealised assumptions that may not hold in real-world FL deployments. For instance, scalability evaluations often assume fixed or limited numbers of participating clients, stable network conditions, uniform hardware capabilities, and relatively controlled data distributions. In practice, however, FL systems frequently encounter unpredictable client churn, variable bandwidth, and highly diverse data environments, particularly in edge and IoT settings. These mismatches between experimental conditions and real-world constraints may limit the generalisability of the reported results. Therefore, when interpreting scalability improvements, it is important to consider the underlying assumptions and assess whether the proposed solutions can sustain performance in heterogeneous and dynamic environments.

In summary, the scalability and real-time performance of FL is always restricted by communication overhead. The cost of transmitting frequent, large, or redundant updates becomes a critical bottleneck as client numbers increase. Solutions such as model compression, selective updates, asynchronous training, decentralised architectures, and intelligent client selection have shown promise in minimising the computational burden. Table 6 shows the approaches applied by the selected articles to mitigate communication overhead.

Table 6. Techniques used to mitigate communication overhead.

Method	Core Technique	Strengths	Suitable Use Case	Source
Model Compression	Singular-Value Decomposition (SVD) Sparse Matrices	Reduces bandwidth Maintains performance	Bandwidth is limited Large-scale client environments	[29,30]
Selective Update	Threshold-based anomaly-triggered updates	Avoid unnecessary transmissions Prioritise critical updates	Anomalies are rare but high impact	[31]
Weighted Aggregation	Data novelty-based weighting	Speeds convergence Reduces redundant updates	Heterogeneous data across clients	[30]
Asynchronous Aggregation	Server updates without waiting for all clients	Reduces latency Avoid bottlenecks from slow clients	Clients vary in availability or speed	[32]
Decentralised FL (DFL)	Peer-to-peer (D2D) Consensus protocols	Eliminates central server Highly scalable and robust	Central server is a bottleneck or failure risk	[33,34]
Semi-Decentralised FL (SDFL)	Local aggregators within client groups	Reduces load on central server	Large client networks with hierarchical topology	[33]
Client Clustering/Selection	Feature similarity-based representative selection	Reduces redundant communication Maintains data diversity	Data is clustered Client count is very high	[30]
Difference Transmission	Transmit delta between local and global models	Significant bandwidth savings Preserves update structure	Frequent model updates with small deltas	[29]
Adaptive Learning Rate	Participation frequency-based adjustment	Balances contributions Reduces overfitting from dominant clients	Clients participate unevenly or intermittently	[29,36]

4.2. Research Question 2—How Can a Federated Learning Framework Adapt to Dynamic Server Clusters?

As mentioned in RQ1, the number of participating clients in an FL system is dynamic in real-world applications. Not only that, FL is designed to train a global model across a large number of decentralised and heterogeneous datasets, which makes efficient server allocation a critical task. To address these challenges, an advanced FL framework should incorporate a dynamic server cluster capable of adapting to changing client participation and data distribution. Dynamic server clustering in FL refers to the situation where the number of servers responsible for handling the model-update aggregation process is changing over time [37]. For example, servers may be dynamically allocated to manage increased workloads and removed when facing resource constraints or failures. Furthermore, the resource management strategies enable the underlying infrastructure that supports the server cluster, such as virtual machines, to dynamically reconfigure based on demand [38], thereby enhancing the system performance and improving scalability.

One of the most critical obstacles in FL is the problem of data heterogeneity [37], where clients possess vastly different data distributions. This causes conflicting gradient updates, slows convergence, and leads to biased global models. Despite significant research efforts, data heterogeneity remains an unresolved challenge in FL-based anomaly detection. This is primarily because client data is inherently non-IID and often unobservable to the central server due to privacy constraints. Clients may exhibit imbalanced or sparse anomaly distributions, different feature spaces, or even entirely missing classes of data. In such cases, aligning local objectives with the global optimisation goal becomes extremely difficult. Additionally, the selective participation of clients in each training round further amplifies statistical biases, hindering stable convergence.

From a theoretical standpoint, these factors disrupt the foundational assumptions of many FL optimisation algorithms, such as FedAvg and FedProx, which assume data similarity across clients. As a result, convergence rates become unpredictable and often require additional constraints or heuristics to stabilise. In terms of generalisation, a model trained under non-IID conditions may overfit to dominant client patterns while failing to detect rare or unseen anomalies in minority clients. These limitations collectively pose significant barriers to achieving robust, generalisable anomaly detection in realistic federated settings. Fortunately, dynamic server clusters can mitigate the problem of data heterogeneity effectively by allocating the aggregation server to handle the client with similar data distribution and computational needs, optimising model updates, and reducing conflicts in the learning process. The dynamic server-clustering strategies can be broadly categorised into server-driven, client-driven, and adaptive learning-based frameworks. While these methods can improve local convergence and model relevance within clusters, they do not entirely eliminate the challenges associated with global model aggregation across diverse clusters. As such, dynamic clustering should be seen as a complementary technique within a broader set of solutions aimed at addressing non-IID data distributions in FL-based anomaly detection.

The first category of dynamic server-clustering strategies is server-driven clustering approaches. Ref. [39] proposed a dynamic clustering algorithm called Dynamic Clustered Federated Learning (DCFL) that combines the concepts of divergence measure, Euclidean distance, and affinity propagation technique, enabling the model to group the clients with similar data distributions into the same cluster. The clustering structure will update automatically when a new client joins the system, and the new client will be assigned to the closest cluster based on the distance metric. Similarly, a change in the client’s data distribution will automatically trigger updates to the clustering structure, ensuring its accuracy. Finally, the Dunn index is applied to evaluate clustering quality after each communication round. The experiment results also show that the proposed DCFL algorithm performs better than the FedAvg Algorithm and divisive clustering approach (CFL) in terms of accuracy.

Similarly, ref. [40] proposed a Fed-RAC (Federated Learning with Resource-Aware Clustering) framework which groups the clients based on their resource capability and data distribution using K-mean clustering. This framework utilised a leader–follower technique which allows the high-resource cluster (leader) to guide the low-resource clusters (followers) by sharing logits and feature information to improve the performance. Clusters are compacted or updated when their performance degrades, using the Dunn index for structure evaluation. The same concept of the leader–follower technique is also applied in [41].

Ref. [42] proposed an innovative Clustered Federated Learning (CFL) method called AICFL (Adaptive Iterative Clustered Federated Learning) that is able to adjust the cluster structure dynamically based on client participation and data loss. The clients are allowed to select clusters where they perform the best. New clusters are created if no match is found, and underperforming clusters are deleted. Ref. [43] further improve the CFL by computing the cosine similarity of gradient updates. Then, the server splits clients into clusters iteratively using optimal bipartitioning algorithms, which can reduce the cross-cluster similarities and increase the intra-cluster cohesion.

Refs. [44,45] used model similarity metrics for clustering. Ref. [44] proposed a framework called FedAC which utilised low-rank cosine similarity (LrCos) and an Expectation–Maximization (EM)-like algorithm to adapt to the dynamic changing of data distribution. Moreover, the proposed Euclidean Distance of Decomposed Cosine similarity (EDC) in [45] decomposes model updates into low-rank components using Singular-Value Decomposition (SVD) before computing pairwise similarities. By simplifying the complexity of similarity calculations, this decomposition simplifies model updates and improves clustering efficiency. Ref. [46] also adopts an EM-based framework but it relies on L2 distance instead of cosine similarity.

The second category is client-driven clustering approaches. The best example is the Client-Driven Federated Learning (CDFL) proposed in [47], which shifts the control of FL from server to client. Clients can decide when to update their models based on changes in their local data distributions, while the server is responsible for estimating client distributions, updating cluster models dynamically, and generating personalised models for each client. The clients self-monitor their model performance and detect distribution shifts, deciding when to update their model. CDFL applies asynchronous communication which allows clients to update their model without waiting for the other server or clients. The on-demand model upload technique enables the clients to initiate updates based on their local needs. The server calculates importance weights to guide aggregation, preventing outdated models from negatively impacting cluster model updates.

Similarly, ref. [47,48,49] shift the control of FL from the central server to the clients, allowing clients to customise models to their local data. In [48], clients perform data interpolation by combining local and global data, controlling the weights of relying on global and local data. Clients also interpolate the local and global models to balance the weight between local and global knowledge. On the other hand, clients in [49] guide updates based on the weights of importance of each cluster to their data. By utilising proximal updates in [49] and subsampling (Dapper) in [48], the clients contribute to the global training without excessive computation. In [50], clients can choose the cluster that aligns with their preferences and withdraw from training when the level of benefits (measured by PMA) does not meet their expectations. Similarly, ref. [51] lets the clients choose clusters by evaluating the minimum local loss of each cluster without centralised control. These approaches emphasise autonomy and personalised learning in dynamic environments.

The third category is adaptive and learning-based clustering. The study of [37] achieves dynamic server clustering through a three-phased clustering framework: GAN-based clustering, cluster calibration (HypCluster), and divisive re-clustering. It adaptively groups clients without sharing raw data and continuously adjusts cluster boundaries based on performance and resource availability. Similarly, ref. [52] applied temporal-based clustering that can group the clients dynamically based on their temporal data patterns in their data distribution. The clients are clustered using K-means clustering, while Silhouette score is used to reassess clusters as data patterns evolve.

In [53], the authors proposed a Multi-Armed Bandit (MAB) framework to realise dynamic and personalised clustering of clients in FL. The core technique in the MAB framework is the dUCB algorithm. The dUCB algorithm continuously updates reward estimates for each cluster based on the latest user data, so the system can adapt to the network changes, such as client join and leave events. Similarly, refs. [54,55] integrate FL with the MAB algorithm to address dynamic environments. In [54], the proposed dynamic worker selection utilised epsilon-greedy and UCB to eliminate the clients, which hindered the improvement of the model by rewarding accuracy improvement between rounds. This adaptive worker-pruning method helps the framework to address the problem of data heterogeneity effectively. Ref. [55] focuses on client scheduling to minimise latency in wireless environments. CS-UCB handles ideal IID settings while CS-UCB-Q uses virtual queues to enforce fairness in non-IID scenarios and dynamic client availability.

In summary, dynamic server clustering in FL is essential for achieving scalability and robustness in real-world deployments. Server-driven methods (e.g., DCFL, Fed-RAC, AICFL) focus on clustering logic and centralised control, often relying on distance metrics and adaptive aggregation. Client-driven approaches (CDFL) emphasise autonomy and personalisation, suitable for highly heterogeneous environments. Learning-based frameworks (MABs, EM) add intelligence and adaptability, particularly in rapidly changing or uncertain network conditions. Table 7 below summarises the methods applied by the selected articles to achieve dynamic server clustering.

Table 7. Methods used for dynamic server clustering.

Category: Server-Driven
Method/Paper	Key Mechanism/Method	Adaptation Mechanism	Advantages
DCFL [39]	Affinity Propagation Dunn Index	Client data distribution changes dynamic cluster restructuring	High clustering quality Supports heterogeneity
Fed-RAC [40,41]	K-means Knowledge Distillation Dunn Index Leader-follower technique	High-resource clients act as leaders Inefficient clusters are compressed	Handles both data and resource heterogeneity Strong hierarchical structure
AICFL [42]	Self-cluster selection Cluster creation/deletion	Clients join/leave clusters based on loss evaluation	Highly adaptive Automatically maintains clustering structure
Improved CFL [43]	Cosine similarity of gradients Optimal bipartitioning	Gradient-based dynamic client reassignment	Improves convergence and model consistency
FedAC [44] EDC [45]	Cosine/SVD similarity EM clustering	Real-time clustering based on model/gradient drift	Reduces noise High clustering accuracy
[46]	L2 distance + EM	Parameter distance-based dynamic reassignment	Simplified EM clustering variant
Category: Client-Driven
CDFL [47]	Client-triggered updates Importance weighting	Clients initiate updates when local performance shifts	Reduces server load Improves personalisation
FedAMP [48] FedPCL [49]	Local-global model interpolation Adaptive weighting	Clients adjust participation and aggregation weights	Strong personalisation and robustness
[50]	Personalised accuracy-based exit mechanism	Clients withdraw when performance drops	Enhances overall model quality
[51]	Clients choose clusters with minimal local loss	Self-evaluation to find best-fitting cluster	Avoids poor matches in training
Category: Adaptive and Learning-based
GAN + HypCluster [37]	GAN-based data generation Hierarchical clustering	Multi-stage adaptation to data and resource variation	Supports label-free clustering Strong generalisation
Temporal K-Means [52]	Temporal reclustering + Silhouette Score	Periodic re-clustering based on evolving data patterns	Handles long-term data changes
MAB + dUCB [53]	Dynamic Upper Confidence Bound	Rewards-based client-cluster assignment	Highly adaptive Fast convergence
MAB Extensions [54,55]	ε-greedy, UCB variants	Selects high-contribution or low-latency clients	Reduces latency Increases fairness

4.3. Research Question 3—What Key Components Are Required to Develop a Standardised Benchmarking Framework for Evaluating the Performance of Federated Learning Anomaly Detection?

Based on all the research and literature above, FL can be considered a technology for anomaly detection with high potential. However, there are lack of a standardised benchmarking frameworks in the existing studies, making it difficult to compare the results of state-of-the-art methods. A unified benchmarking framework is important to guide the future research and development of FL in anomaly detection by providing a fair and reproducible evaluation and comparison among all the studies [56]. Even if such a unified benchmarking framework were established, its applicability must still be aligned with the specific characteristics of the anomaly detection task. For example, benchmarks used in time-series anomaly detection for industrial IoT may not be suitable for benchmarking image-based or transactional fraud detection systems, due to differences in data structure, anomaly frequency, and response-time requirements. Therefore, care must be taken to ensure that evaluation metrics and experimental conditions are chosen based on the nature and context of the target application. In some domains, such as healthcare or industrial control, the cost of false negatives may be far more critical than false positives, whereas the opposite may be true for spam filtering or fraud detection. A meaningful benchmark must reflect these application-specific trade-offs, rather than assuming a one-size-fits-all evaluation approach. Accordingly, the following content does not aim to prescribe a rigid set of components that a benchmarking framework must include. Instead, it draws upon observations from a wide range of existing studies to outline the common conditions and characteristics that a well-designed benchmarking framework for FL-based anomaly detection should ideally possess.

A benchmarking framework of Federated Learning (FL) commonly includes several key components. The first is the dataset, which can be categorised into IID (independent and identically distributed) and Non-IID (non-independent and identically distributed). Non-IID datasets are often matched to real-world applications. This is because the data distribution across participating clients often varies significantly in practical environments. Next, selecting a suitable FL architecture, such as FedAvg, FedProx, or FedSGD, is important. If the client environment is dynamic, asynchronous FL is a good option that can be considered. In addition, choosing the right anomaly detection model is also a crucial step in framework design because different models have their own strengths. For example, LSTM is good in time-series analysis, while Isolation Forest is good at detecting outliers. Once the model architecture is built, evaluation metrics need to be defined to evaluate the performance of the proposed framework. Traditional detection metrics, such as accuracy, precision, recall, F1-score, and AUC, are important parts of performance evaluation. Additionally, FL-specific metrics, such as communication cost, model convergence speed, and client participation rate, should be designed based on experimental requirements. Finally, the experimental setup includes the division of training and test sets, as well as the design of comparative experiments (e.g., centralised vs. federated methods) to validate the effectiveness of the proposed solution. Each of these components are discussed in detail in the following paragraphs.

4.4. Dataset

To build a standardised benchmarking framework for federated anomaly detection, the dataset is one of the most important elements because good model training is highly reliant on a good dataset. It is important to know the essential characteristics that a good dataset should include that make it suitable for benchmarking in this domain, rather than just merely choosing popular datasets. The important characteristics of the dataset include realism, high dimensionality, temporal continuity, heterogeneity (non-IID nature), label availability, and scalability for distributed settings.

The first and the most crucial characteristic of a benchmark dataset is its realism and relevance of dataset. A good benchmark dataset should mirror real-world environments. For example, the IBM Cloud dataset which was introduced in [57] is a really good example of a highly realistic dataset. This dataset was collected from IBM Cloud Console over 4.5 months. There are 39,365 rows and 117,448 columns of telemetry data included in this dataset, reflecting the complexity of modern cloud-monitoring environments. The high realism of this dataset is highly beneficial for training and evaluating FL-based anomaly detection models under production-like environments.

The second characteristic is high dimensionality and multivariate signals of the dataset. A detail and comprehensive dataset is typically high-dimensional, which involves numerous metrics such as CPU, memory, API calls, and error logs to record the detailed information on the relevant topic. To ensure the robustness of the model, a good benchmark dataset should have this multivariate structure. The Microsoft Cloud Monitoring dataset introduced in [57] greatly meets these needs by providing moderately multivariate data with 67 time-series signals (averaging 3757 rows) with timestamped metrics like API latency and crash rates. This kind of dataset is more suitable for lighter benchmarking tasks but is not sufficient for high-complexity anomaly detection.

The third important feature of a benchmarking dataset is its temporal structure and continuity. This is because in real-world application, the anomalies often unfold over time, which highlights the importance of time-series continuity. The NAB dataset which was introduced in [57] follows this characteristic. This dataset consists of real-world and synthetic time-series data with timestamped anomalies, specially designed for testing the anomaly detection in real-time streaming applications. This type of temporal structure is extremely important for applying the recurrent models like LSTM or GRU in the FL-based anomaly detection.

The heterogeneity and non-IID distribution of the dataset are also important considerations because the data across clients is rarely identically distributed in the federated settings. The Westermo network traffic dataset in [58] which aimed to detect network anomalies in a distributed industrial network can also be partitioned by a device or subsystem to emulate FL setups. This dataset consists of various anomalies like cyber-attacks and switch misconfigurations introduced in a simulated factory testbed. Similarly, the CICIDS2017 dataset that was introduced in [59] is a very good sample of dataset that can be partitioned by attack type or source IP to simulate non-IID client environments. This dataset is generated based on real network traffic and is widely used as the benchmark dataset for evaluating the performance of intrusion detection systems (IDS). CICIDS2017 consists of various types of network attacks such as Web Attacks, Port Scans, and Brute Forces, with labelled normal or malicious data, reflecting the real-world scenario.

The fifth characteristic of a benchmarking dataset is the label availability of the data. These data must include accurate, diverse, and sufficiently numerous labelled anomalies. Some very good datasets include the CICIDS2017 [59], NAB, and Microsoft datasets [57]. These datasets include multiple types of network attacks and are well-labelled for the normal or malicious data. These labels allow models to be trained in supervised or semi-supervised settings and evaluated across multiple anomaly types.

Lastly, a benchmarking dataset for FL-based anomaly detection should be scalable for distributed settings. Tabular datasets like NSL-KDD, Arrhythmia, and Thyroid which were introduced in [56] can be split across features or class labels to represent multiple clients. These allow researchers to experiment with various client configurations, aiding in scalability studies. Table 8 summarises the dataset discussed above, highlighting their domain, availability (public or private), and key benchmarking-relevant features such as temporal structure, dimensionality, label availability, and FL scalability.

Table 8. Example of benchmarking datasets.

Dataset Name and Size	Domain	Availability	Temporal (Time Series)	Multivariate/High-Dimensional	Non-IID Partition Feasible	Labelled Anomalies	Real-World (vs Simulated)	Scalable to FL Setting
IBM Cloud [57] (39,365 records, 117,448 features)	Cloud Telemetry	Public	No, Tabular Snapshot	Very High-Dimensional	Limited (needs engineering)	Not Explicitly	Real-world	Limited due to size
NAB [57] (57 time series × ~6303 rows each)	Streaming/Cloud Ops	Public	Yes	Limited (univariate/mixed)	Time-based partitions	Yes	Mixed (real + synthetic)	Yes
Microsoft Cloud Monitoring [57] (67 time series × ~3757 rows each)	Cloud Systems Monitoring	Public	Yes	Moderate (67 metrics)	Can simulate clients	Yes	Real-world	Yes
Westermo ICS [58] (1.8 million packets → 48,657 flow records, 53 features)	Industrial Networks/ICS	Public	Yes	Multivariate and High-Dimensional	Per device/attack type	Yes	Simulated testbed	Yes
CICIDS2017 [59] (~2.8 million network flow records)	Network Security/IDS	Public	Yes (Flow-based)	Multivariate	Per attack/client	Yes	Realistic traffic	Yes
KDDCUP99 [56] (494,021 samples, 41 features) NSL-KDD [56] (148,517 samples, 41 features)	General Network Intrusion	Public	No, Static	Low to Moderate	Easily partitioned	Yes	Synthetic or outdated	Yes
Arrhythmia [56] (452 samples, 274 features) Thyroid [56] (3772 samples, 6 features)	Medical/Bioinformatics	Public	No, Tabular	Moderate	Class-based partitions	Yes	Out-of-domain	Yes

4.5. Experiment Setup

A comprehensive and standardised experiment setup can be a great booster for the reproducible and comparable benchmarking of FL models for anomaly detection. However, current FL-based anomaly detection research often lacks consistency in how experiments are designed and reported, leading to difficulties in interpreting results and drawing generalisable conclusions. In response, this section outlines the key components that should be standardised in any FL anomaly detection evaluation pipeline. This section aimed to explain what experimental setup components must be clearly defined and standardised to achieve fair, reproducible, and comparable evaluations of Federated Learning anomaly detection methods. These components range from client simulation and communication protocols to training strategies and baseline definition, representing the foundational building blocks of a benchmarking framework.

One of the key elements of any FL experiments is the client simulation. This includes the consideration in the number of clients, client heterogeneity, and client availability. First, the number of clients is important because it may affect the scalability and convergence behaviour of FL models. Studies such as [60,61] include multiple clients ranging from IoT nodes to data centre devices. The number of clients should be configured according to the model’s computational capability and resource availability to mitigate excessive computational overhead. The second point is the client heterogeneity. In model training, non-IID data distributions more closely resemble real-world scenarios, enhancing the model’s adaptability to practical applications and making them a more appropriate choice for benchmarking [56,62]. The number of clients changes dynamically over time in the real-world applications. Some studies applied dynamic client selection to handle such situations [62]. However, these selection rules are often different across studies, making it hard to compare results. A standard way of simulating client availability should be defined and reported clearly.

The second element to be noticed in order to define a standardised benchmarking framework is communication configuration in FL setups because this can affect both model accuracy and system efficiency. The communication round and client–server exchange size of each FL-related experiment should be recorded clearly. Communication rounds refer to the total number of local-to-global update iterations and the convergence trend of models across the training rounds is important information and a key benchmarking metric [60]. On the other hand, the client–server exchange size is also an important element in deciding the scalability of the FL as it impacts the bandwidth and energy cost. The communication frequency and delay between clients and servers should be clearly specified, especially if asynchronous setups are used, as it affects the latency and synchronisation of the entire FL framework. A very good sample for this is [60], where the use of asynchronous methods such as FedSam was clearly recorded.

The third important element is the model-training protocol, where the key parameters of the model training such as number of local epochs per round, minimum batch size, optimiser, and learning rate should be reported in detail, as these configurations directly influence global model performance. For example, study [61] has recorded all the details of the model training, including batch size, epochs, optimiser, learning rate, and the architecture of the model. However, these details are often inconsistently reported across studies. It must be standardised for proper comparison.

Next, the FL experiment setup must evaluate and compare both standard and novel aggregation methods to identify their suitability across anomaly detection tasks. A benchmarking framework should clearly define and evaluate the chosen strategy. For example, the FedSam aggregation method applied in [60], the DRL-assisted FL that combines FL with reinforcement learning to improve client selection and aggregation [62], and the standard and most commonly used FedAvg method that is applied in [61,63,64].

Comparing the proposed FL solution with the baseline approaches is the best way to demonstrate its value. However, the baseline definitions and comparisons across the experiment always vary and lack standardisation. To fairly evaluate the comparisons, the suggested baseline approaches include centralised training, local-only models, and non-FL distributed models. The studies [60,64] incorporate this comparison framework, comparing their proposed solutions against the centralised training models and non-FL models. Table 9 shows the experiment setup of existing studies with detailed experiment setup recorded.

Table 9. Experiment setup of existing studies.

Paper	Dataset	Data Type	Model Used	FL Method	FL Aggregation Algorithm
[60]	CIC-IDS2017, CIC-IDS2018, NCC-DC, MAWI-Lab	Non-IID, Multivariate	Autoencoder + Classifier	Supervised	FedSam (novel min–max + sampling)
[61]	ToN_IoT dataset	Heterogeneous IoT sensor data (Non-IID)	DNN, LSTM, GRU, FCN, LeNet	Supervised	Federated Averaging (FedAvg)
[62]	TON_IoT, Edge_IIoT, X-IIoTID	Non-IID, Multivariate Time Series	GRU (Gated Recurrent Units) + DRL	Supervised	FedAvg with DRL-assisted selection
[63]	EPS test jig dataset	Multivariate Time Series	Unsupervised Anomaly Detection (USAD)	Unsupervised	Federated Averaging (FedAvg)
[64]	CICDDoS2019, Bot-IoT, UNSW-NB15	Network Traffic (IID)	Autoencoder	Unsupervised	Federated Averaging (FedAvg)

4.6. Performance Metrics

To ensure a fair comparison of different experimental results, a well-defined combination of evaluation metrics is the main measurement of the performance of the proposed solution. Traditional performance metrics such as accuracy, precision, recall, F1-Score, and false-positive rate (FPR) are commonly used to assess the performance of anomaly detection models. Accuracy measures the proportion of correctly classified instances (normal and anomalous). FPR calculates the percentage of normal instances that were wrongly flagged as anomalies. Precision measures the proportion of correctly identified anomalies among flagged instances. Recall measures the percentage of actual anomalies successfully detected. F1-Score balances precision and recall, offering a fair evaluation for imbalanced datasets. Apart from that, Area Under the ROC Curve (AUROC) assesses the model’s discriminatory ability across various classification thresholds while Area Under the Precision–Recall Curve (AUPR) is better suited for imbalance datasets, emphasising precision–recall trade-offs.

Beyond the traditional performance metrics, FL-specified performance indicators need to focus on evaluating the communication efficiency of the system [65]. The indicators for communication efficiency include the number of communication rounds required for the model to converge, the size of the model updates exchanged between clients and the central server, and the overall latency of the Federated Learning process. Researchers can utilise these performance indicators to achieve an optimal balance between scalability and accuracy in FL systems. In [66], the study used both the traditional performance metrics (accuracy, AUC, and MSE) and FL-specified performance indicators such as training time, communication cost, and memory usage to evaluate the efficiency and scalability of FL frameworks in real-world applications. In [61], the loss function is used to measure model performance during training. Loss function measures the difference between the model’s predictions and the actual labels, the lower loss indicates better performance. The loss function used is Categorical Cross-Entropy, a common metric for classification tasks like intrusion detection.

Several novel evaluation metrics have also been proposed to enhance anomaly detection assessment in FL. In [67], Reputation and Trust Scores are introduced. The key components in this metric are Reputation (R), which refers to the historical measure of how consistently a local edge unit provides reliable model updates, and Trust (T), which determines if a unit is trustworthy enough to participate in FL aggregation. A high R-score means that the unit’s updates are consistently close to the majority, while a low R-score means that there is a high possibility of anomalies such as poisoning attacks. On the other hand, when the T-score is 1, it is a reliable unit, while when the T-score is 0, the unit is malicious or unreliable. This Trust calculus is robust against outliers and can be combined with various aggregation methods. In [68], the study utilised False Detection Rate to measure the proportion of false alarms (incorrectly flagged anomalies) among all detected anomalies. A lower FDR implies a more reliable detection method. This study also calculates the Intrusion Score (IS) based on the reconstruction error of normal traffic. A higher reconstruction error means that an attack is more likely. In summary, a good benchmarking framework should apply the traditional performance metrics, FL-specific metrics, and novel evaluation methods to improve robustness and reliability in anomaly detection.

While a wide range of traditional, FL-specific, and novel performance metrics have been proposed and adopted, it is important to recognise that the relevance and priority of these metrics can vary significantly across different application domains. Among these, metrics such as precision, recall, F1-Score, AUROC, and false-positive rate (FPR) each highlight different aspects of model performance, and their importance depends largely on the operational risks associated with false alarms or missed detections. For instance, in safety-critical environments such as industrial automation or medical diagnostics, the consequences of false negatives, or failures to detect actual anomalies, can be catastrophic. In such cases, recall or false-negative rate (FNR) becomes especially critical. Conversely, in domains like spam filtering or credit-card fraud detection, false positives can cause unnecessary disruptions, and therefore, precision may be more relevant. Accordingly, benchmarking frameworks and evaluation strategies should be carefully aligned with the risk profile, tolerance thresholds, and domain-specific goals of the application, rather than relying on a fixed set of general-purpose metrics.

5. Discussion

The main goal of this systematic review is to explore the challenges associated with applying FL to anomaly detection in distributed server environments. Specifically, this systematic review addresses three main research questions: (RQ1) the impact of communication overhead on the scalability and real-time performance of FL-based anomaly detection; (RQ2) the adaptability of FL frameworks to dynamic server clusters; and (RQ3) the key components required for a standardised benchmarking framework for FL-based anomaly detection. Informed by a thorough examination of the reviewed studies, this section presents several insights and personal reflections on recurring patterns and research gaps.

As mentioned earlier, communication overhead and computational burden are the significant bottlenecks to the scalability and real-time performance of FL-based anomaly detection. While it is true that the communication overhead in FL is influenced not only by data distribution but also by the structure and size of the model used, most of the studies discussed in this review focus on deep neural network (DNN)-based models such as LSTM, CNN, and autoencoders, rather than lightweight models like logistic regression or SVM. This is because the reviewed literature primarily addresses application domains such as IoT networks, network security, and cloud infrastructure monitoring. These domains are typically characterised by complex behavioural patterns, high-dimensional data, and dynamic, non-stationary environments. In such settings, simple models often fail to capture the necessary temporal dependencies and intricate feature interactions required for effective anomaly detection. Although DNN-based models incur greater communication and computational costs, they offer stronger generalisation capabilities, enhanced robustness to data heterogeneity, and better adaptability to non-IID data distributions. Furthermore, in real-world deployments where data is frequently noisy, partially labelled, and continuously evolving, the accuracy and generalisability of anomaly detection become especially important. Under these conditions, deep learning models have proven not only to be more effective, but in many cases, essential. Therefore, the emphasis on complex model architectures in this review reflects the practical requirements and performance expectations of modern FL-based anomaly detection systems, rather than an arbitrary methodological preference. The previous section discusses the many existing methods, such as the asynchronous FL algorithm and model compression techniques, that have been introduced to mitigate communication overhead problem. However, each of these methods has its flaws. For instance, asynchronous FL can speed up convergence by allowing clients to update their models independently without waiting for slow participants. Nevertheless, this can lead to the issue of staleness in models, where updates are based on outdated local weights. A large staleness can slow down overall convergence speed and lead to a greater error in the global model update [69]. This trade-off becomes particularly significant in scenarios with high client churn or intermittent connectivity, where synchronous updates are often infeasible or inefficient. Asynchronous FL allows such clients to participate flexibly, thereby improving resource utilisation and communication efficiency. However, the lack of synchronisation may exacerbate model staleness and reduce anomaly detection performance, especially when updates from unstable or unreliable clients dominate the aggregation process. Robust strategies that account for participation frequency and update quality are therefore essential for maintaining model stability in such environments. Research should focus on methods such as staleness-aware aggregation and adaptive learning rates to mitigate issues to optimise asynchronous FL in the context of anomaly detection.

In addition, the model compression technique can reduce the communication costs effectively, but an aggressive compression can degrade the convergence speed and final accuracy of the model, especially when applied to deep learning architectures with complex feature representations [70]. Another commonly used strategy is the selective update mechanism, which reduces communication by transmitting only significant local updates. However, this method also introduces notable limitations, especially in heterogeneous client environments. Clients with low activity or infrequent anomalies may rarely send updates, leading to under-representation in the aggregated model. This can result in biased learning and reduced generalisability. Moreover, the use of fixed thresholds to determine update significance may not generalise well across varying data distributions or anomaly types. Theoretical understanding of convergence behaviour and fairness under selective update regimes remains limited. While selective updates are effective in improving efficiency, their use in FL-based anomaly detection requires careful calibration, particularly in non-IID and imbalanced data scenarios. These insights suggest that future FL frameworks for anomaly detection must be adaptive, context-aware, and capable of balancing trade-offs between efficiency and accuracy to ensure robust and scalable deployment in real-world environments.

The exploration of adaptive strategies in FL presents significant opportunities for enhancing both efficiency and scalability, particularly in anomaly detection applications. Future research should prioritise the development of adaptive weight-aggregation algorithms that account for gradient staleness, ensuring timely and accurate model updates [71]. Additionally, integrating adaptive communication strategies—such as Adaptive FedAvg and FedBuff—can further reduce synchronisation overhead. These methods dynamically adjust communication frequency and client participation based on network conditions and client contributions, thereby improving overall system responsiveness and resource utilisation [72]. Moreover, the integration of edge computing with FL offers a promising hybrid architecture to alleviate server load and bandwidth constraints. By enabling edge devices to perform partial model training locally before synchronising with cloud servers, this method enhances scalability and reduces latency. Future studies should delve deeper into edge-FL integration, especially in the context of real-time anomaly detection, to develop more robust, efficient, and context-aware FL frameworks suitable for deployment in dynamic and resource-constrained environments.

The previous section discussed how the existing studies utilise dynamic clustering algorithms to group the clients with similar data characteristics in the dynamic server cluster. Although this method can address the problem of data heterogeneity, it can be computationally expensive, especially in a large-scale system. This is because the model needs to continuously recalculate distance metrics (e.g., divergence measures, cosine similarity) and cluster validity indices (e.g., Dunn index) in every training round to ensure the correct client allocation. In relation to RQ2 on dynamic server cluster adaptability, future research can consider exploring the hybrid approach that integrates selective federated updates [31], SVD-based parameter compression [30], CDFL [47], DCFL [39], and edge computing. In this hybrid approach, selective federated updates [31] can minimise unnecessary reclustering by triggering the reclustering only when the significant data distribution changes instead of performing reclustering in every training round. At the same time, SVD-based parameter compression [30] reduces the complexity of similarity calculations, enhancing the clustering efficiency. Then, CDFL [47] enables decentralised local aggregation at network edges before final server synchronisation. Meanwhile, DCFL [39] is applied to enhance clustering efficiency in resource-constrained environments by reducing frequent communication between the client and central server. Finally, edge computing distributes the computational workload equally to all the edge devices, reducing reliance on the central server. This combined approach not only enables dynamic clustering but also addresses the problem of communication overhead and data heterogeneity, improving the efficiency of the FL system in a large-scale and dynamic environment.

Drawing upon the reviewed literature, several Federated Learning algorithms and frameworks exhibit strong potential in enhancing anomaly detection capabilities across various application contexts. For example, FedBuff and Adaptive FedAvg [72] dynamically adjust communication intervals and client participation based on contribution levels and prevailing network conditions. These methods deliver notable improvements in responsiveness and scalability, which is particularly important in IoT and smart grid monitoring environments characterised by fluctuating network reliability. In decentralised settings, approaches such as Decentralised FL (DFL) [33,34] and CDFL [47] enable peer-to-peer aggregation or hierarchical edge learning, thereby reducing reliance on centralised servers and offering practical benefits for edge and fog computing scenarios. In cases involving substantial data heterogeneity or dynamic server clusters, methods including DCFL [39], FedAMP [48], and FedAC [44] incorporate adaptive client grouping or personalisation strategies. These mechanisms allow the learning framework to accommodate changing workloads, resource constraints, and variations in local performance. Such adaptability is especially relevant in domains such as cybersecurity and industrial control, where client behaviours are non-stationary and latency requirements are stringent. Although no single method can be considered universally superior, each of these frameworks offers distinct advantages when applied within specific operational contexts.

Concerning RQ3, which focuses on the components for a standardised benchmarking framework, data privacy protection and cross-domain adaptability are the points that need to be considered. To avoid raw data disclosure and ensure research reproducibility, synthetic data generation and feature statistic sharing (e.g., mean and variance) can be applied, as demonstrated by the IBM Cloud dataset [57] and Microsoft SmartNoise framework. Although this method enhances data privacy protection, it still suffers from weak generalisability. For instance, cloud server monitoring time-series data and network security event data are fundamentally different and require distinct model architectures. Future research can consider the integration of DCFL’s dynamic clustering mechanism [39] by embedding an adaptive module that supports domain-specific model switching into the FL framework, such as FedML’s API [73]. At the same time, hierarchical edge aggregation like CDFL [47] can be leveraged to enable resource-aware distributed training. This combination not only ensures privacy-preserving data simulation but also enhances flexible architecture adaptation of FL frameworks, thereby improving their reproducibility and scalability.

Beyond architectural design, a robust benchmarking framework should also account for three fundamental evaluation dimensions: privacy, latency, and detection efficacy. Privacy should be evaluated not only through the use of protective mechanisms such as differential privacy or secure aggregation, but also by reporting measurable indicators such as privacy budgets (for example, epsilon values) alongside their effects on model utility. Latency can be assessed using metrics such as communication cost per training round, average update time, or client-response delays, which are essential in deployment-sensitive scenarios. Detection efficacy, particularly in imbalanced anomaly detection tasks, should be measured through a broader set of indicators, including false-positive rates, time to detection, and class-specific precision, in addition to traditional metrics like F1-score and accuracy. Despite the importance of these dimensions, evaluations are frequently imbalanced or incomplete. Many studies focus primarily on detection accuracy while omitting details regarding privacy or latency. Additionally, inconsistencies in experimental configurations, including variations in client numbers, sampling frequencies, and data distributions, hinder meaningful comparison between studies. Some evaluations are conducted in idealised settings with stable connectivity and complete client participation, which fail to reflect the variability encountered in practical Federated Learning environments.

An effective benchmarking strategy should ensure a balanced and transparent assessment across the three aforementioned dimensions. Methods designed to protect privacy, such as secure aggregation and differential privacy, often introduce computational and communication overhead, which may increase latency or impede convergence. Conversely, approaches intended to reduce latency, such as model compression or communication sparsification, can compromise anomaly detection performance. A well-designed framework must therefore incorporate comprehensive metrics that explicitly reflect these trade-offs. Such an approach would facilitate fair and reproducible evaluation and assist practitioners in selecting methods that align with specific operational constraints.

6. Limitations of the Study

This systematic review focuses on Federated Learning (FL), specifically in its application in the field of anomaly detection. Hence, the scope of this study is limited, excluding the discussions on FL applications in other fields such as resource monitoring. The challenges analysed in this review are also constrained to communication overhead (RQ1), dynamical server clustering (RQ2), and the key component to develop a standardised benchmarking framework (RQ3). In addition, the literature search was conducted with a limited set of predefined keywords, and only journal and conference papers published between 2020 and 2025 were considered. To obtain only the relevant papers, the collected papers were further filtered based on the predefined inclusion and exclusion criteria, which may have led to the omission of potentially relevant studies outside these parameters.

While this review aims to provide a comprehensive overview of FL for anomaly detection, it is important to acknowledge that the term “anomaly detection” spans a wide range of application domains, such as image processing, financial fraud detection, and cybersecurity—each with distinct data characteristics and model requirements. The comparative analyses and discussions in this paper primarily focus on distributed system environments, including IoT networks, cloud infrastructure, industrial systems, and network intrusion scenarios. These domains typically involve structured, time-series, or log-based data, which are well-suited to the FL strategies discussed. Therefore, caution should be taken when generalising the findings of this review to domains like computer vision or financial transaction monitoring, where data modalities, feature representations, and anomaly semantics may differ significantly.

7. Conclusions

This systematic review aims to conduct a comprehensive study on the existing solutions to address the challenges encountered in Federated Learning-based anomaly detection. There are a total of 43 papers published between 2020 and 2025, which were reviewed to answer the research questions. Based on this analysis, the key findings are summarised as follows. For RQ1, many studies identified that communication overhead is the primary bottleneck affecting the scalability and real-time performance of FL-based anomaly detection. The existing research tends to use methods such as client-selection mechanisms, asynchronous update strategies, parameter compression techniques, and Decentralised Federated Learning (DFL) as the solution to optimise communication efficiency.

In RQ2, adapting FL to dynamic server clusters is essential to handle the situation where the clients join and exit continuously. However, data heterogeneity remains a fundamental challenge in achieving a robust and adaptive FL system. Clustering techniques such as ClusterGAN, HypCluster, Fed-RAC, and Clustered Federated Learning (CFL) are the common solutions to improve adaptability. In addition, the leader–follower architecture and client-driven FL are also methods to balance performance and communication costs in adaptive FL. For RQ3, the lack of a standardised benchmarking framework is a significant challenge. A well-structured benchmarking framework should include a well-defined dataset, a consistent experiment setup and standardised performance metrics. However, current research often employs varied experimental configurations, which hinders fair and meaningful comparisons across studies. Despite the increasing interest in FL-based anomaly detection, current benchmarking efforts remain fragmented and inconsistent. To establish a universally accepted evaluation framework, it is essential to incorporate not only detection efficacy but also factors such as communication latency, privacy guarantees, and deployment scalability. The decentralised characteristic of FL makes it a promising method for anomaly detection in a distributed environment. As for future research directions, several promising areas warrant further investigation. Future research should prioritise the development of comprehensive and standardised benchmarks that reflect the diverse operational constraints found in real-world distributed systems. Although the advancement in communication efficiency and adaptive learning have mitigated the problem of communication overhead and data heterogeneity, the lack of a unified evaluation framework hinders fair comparison and real-world deployment. Future research should focus on developing adaptive FL-based anomaly detection methods that minimise communication overhead while addressing the staleness problem. For example, adaptive staleness-aware aggregation, selective client participation, and hybrid edge-cloud frameworks can be explored to improve scalability and robustness. Apart from that, enhancing data privacy protection and improving cross-domain adaptability within FL-based benchmarking frameworks are critical next steps. Establishing standardised datasets and evaluation protocols is also important to ensure comparability across studies. In particular, future benchmarking frameworks should incorporate comprehensive evaluation of privacy, latency, and detection efficacy, as these dimensions capture the core trade-offs faced in practical deployments and are essential for ensuring fair and reproducible comparison across diverse application contexts. Addressing these challenges will be essential in advancing FL for scalable, adaptable, and real-world anomaly detection applications.

Author Contributions

Funding acquisition, L.-Y.O. and M.-C.L.; Investigation, L.-H.L.; Project administration, L.-Y.O.; Supervision, L.-Y.O.; Visualization, L.-H.L.; Writing—original draft, L.-H.L.; Writing—review and editing, L.-H.L., L.-Y.O. and M.-C.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Telekom Malaysia Research & Development, RDTC/241125 (MMUE/240066).

Data Availability Statement

This study does not report any data.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Ivanovic, M. Influence of Federated Learning on Contemporary Research and Applications. In Proceedings of the 2024 International Conference on INnovations in Intelligent SysTems and Applications (INISTA), Craiova, Romania, 4–6 September 2024; pp. 1–6. [Google Scholar] [CrossRef]
Li, Y.; Yan, Y.; Liu, Z.; Yin, C.; Zhang, J.; Zhang, Z. A federated learning method based on blockchain and cluster training. Electronics 2023, 12, 4014. [Google Scholar] [CrossRef]
Chandola, V.; Banerjee, A.; Kumar, V. Anomaly detection: A survey. ACM Comput. Surv. 2009, 41, 1–58. [Google Scholar] [CrossRef]
Hosseinzadeh, M.; Rahmani, A.M.; Vo, B.; Bidaki, M.; Masdari, M.; Zangakani, M. Improving security using SVM-based anomaly detection: Issues and challenges. Soft Comput. 2021, 25, 3195–3223. [Google Scholar] [CrossRef]
Lin, T.-H.; Jiang, J.-R. Anomaly Detection with Autoencoder and Random Forest. In Proceedings of the 2020 International Computer Symposium (ICS), Tainan, Taiwan, 17–19 December 2020; pp. 96–99. [Google Scholar] [CrossRef]
Lee, T.-W.; Ong, L.-Y.; Leow, M.-C. Experimental Study using Unsupervised Anomaly Detection on Server Resources Monitoring. In Proceedings of the 2023 11th International Conference on Information and Communication Technology (ICoICT), Melaka, Malaysia, 23–24 August 2023; pp. 517–522. [Google Scholar] [CrossRef]
Wang, N.; Yang, W.; Wang, X.; Wu, L.; Guan, Z.; Du, X.; Guizani, M. A blockchain based privacy-preserving federated learning scheme for Internet of Vehicles. Digit. Commun. Netw. 2024, 10, 126–134. [Google Scholar] [CrossRef]
Ganguly, B.; Aggarwal, V. Online Federated Learning via Non-Stationary Detection and Adaptation Amidst Concept Drift. IEEE/ACM Trans. Netw. 2024, 32, 643–653. [Google Scholar] [CrossRef]
Marfo, W.; Tosh, D.K.; Moore, S.V. Adaptive Client Selection in Federated Learning: A Network Anomaly Detection Use Case. In Proceedings of the 2025 International Conference on Computing, Networking and Communications (ICNC), Honolulu, HI, USA, 17–20 February 2025; pp. 601–605. [Google Scholar] [CrossRef]
Drainakis, G.; Pantazopoulos, P.; Katsaros, K.V.; Sourlas, V.; Amditis, A.; Kaklamani, D.I. From centralized to Federated Learning: Exploring performance and end-to-end resource consumption. Comput. Netw. 2023, 225, 109657. [Google Scholar] [CrossRef]
Zhang, H.; Ye, J.; Huang, W.; Liu, X.; Gu, J. Survey of federated learning in intrusion detection. J. Parallel Distrib. Comput. 2024, 195, 104976. [Google Scholar] [CrossRef]
Fedorchenko, E.; Novikova, E.; Shulepov, A. Comparative review of the intrusion detection systems based on federated learning: Advantages and open challenges. Algorithms 2022, 15, 247. [Google Scholar] [CrossRef]
Agrawal, S.; Sarkar, S.; Aouedi, O.; Yenduri, G.; Piamrat, K.; Alazab, M.; Bhattacharya, S.; Maddikunta, P.K.R.; Gadekallu, T.R. Federated learning for intrusion detection system: Concepts, challenges and future directions. Comput. Commun. 2022, 195, 346–361. [Google Scholar] [CrossRef]
Siddiqi, S.; Qureshi, F.; Lindstaedt, S.; Kern, R. Detecting Outliers in Non-IID Data: A Systematic Literature Review. IEEE Access 2023, 11, 70333–70352. [Google Scholar] [CrossRef]
Berkani, M.R.A.; Chouchane, A.; Himeur, Y.; Ouamane, A.; Miniaoui, S.; Atalla, S.; Mansoor, W.; Al-Ahmad, H. Advances in Federated Learning: Applications and Challenges in Smart Building Environments and Beyond. Computers 2025, 14, 124. [Google Scholar] [CrossRef]
Wang, Y.; Zobiri, F.; Mustafa, M.A.; Nightingale, J.; Deconinck, G. Consumption prediction with privacy concern: Application and evaluation of Federated Learning. Sustain. Energy Grids Netw. 2024, 38, 101248. [Google Scholar] [CrossRef]
Reddy, D.T.; Nandigam, H.; Indla, S.C.; Raja, S.P. Federated Learning in Data Privacy and Security. Adv. Distrib. Comput. Artif. Intell. J. 2024, 13, 21. [Google Scholar]
Korkmaz, A.; Rao, P. A Selective Homomorphic Encryption Approach for Faster Privacy-Preserving Federated Learning. arXiv 2025, arXiv:2501.12911. [Google Scholar]
Cavallin, F.; Mayer, R. Anomaly detection from distributed data sources via federated learning. In International Conference on Advanced Information Networking and Applications; Springer International Publishing: Cham, Switzerland, 2022; pp. 317–328. [Google Scholar] [CrossRef]
Olanrewaju-George, B.; Pranggono, B. Federated learning-based intrusion detection system for the internet of things using unsupervised and supervised deep learning models. Cyber Secur. Appl. 2025, 3, 100068. [Google Scholar] [CrossRef]
Nardi, M.; Valerio, L.; Passarella, A. Anomaly Detection Through Unsupervised Federated Learning. In Proceedings of the 2022 18th International Conference on Mobility, Sensing and Networking (MSN), Guangzhou, China, 14–16 December 2022; pp. 495–501. [Google Scholar] [CrossRef]
Li, Y.; Li, Y. Semi-supervised federated learning for collaborative security threat detection in control system for distributed power generation. Eng. Appl. Artif. Intell. 2025, 148, 110374. [Google Scholar] [CrossRef]
Quyen, N.H.; Duy, P.T.; Nguyen, N.T.; Khoa, N.H.; Pham, V.H. FedKD-IDS: A robust intrusion detection system using knowledge distillation-based semi-supervised federated learning and anti-poisoning attack mechanism. Inf. Fusion 2025, 117, 102807. [Google Scholar] [CrossRef]
Nguyen, V.T.; Beuran, R. Fedmse: Semi-Supervised Federated Learning Approach for IoT Network Intrusion Detection. Comput. Secur. 2025, 151, 104337. [Google Scholar] [CrossRef]
Tham, C.-K.; Yang, L.; Khanna, A.; Gera, B. Federated Learning for Anomaly Detection in Vehicular Networks. In Proceedings of the 2023 IEEE 97th Vehicular Technology Conference (VTC2023-Spring), Florence, Italy, 20–23 June 2023; pp. 1–6. [Google Scholar] [CrossRef]
Hao, J.; Chen, P.; Chen, J.; Li, X. Effectively detecting and diagnosing distributed multivariate time series anomalies via Unsupervised Federated Hypernetwork. Inf. Process. Manag. 2025, 62, 104107. [Google Scholar] [CrossRef]
Shrestha, R.; Mohammadi, M.; Sinaei, S.; Salcines, A.; Pampliega, D.; Clemente, R.; Lindgren, A. Anomaly detection based on LSTM and autoencoders using federated learning in smart electric grid. J. Parallel Distrib. Comput. 2024, 193, 104951. [Google Scholar] [CrossRef]
Liu, F.; Zeng, C.; Zhang, L.; Zhou, Y.; Mu, Q.; Zhang, Y.; Zhang, L.; Zhu, C. FedTADBench: Federated Time-series Anomaly Detection Benchmark. In Proceedings of the 2022 IEEE 24th Int Conf on High Performance Computing & Communications; 8th Int Conf on Data Science & Systems; 20th Int Conf on Smart City; 8th Int Conf on Dependability in Sensor, Cloud & Big Data Systems & Application (HPCC/DSS/SmartCity/DependSys), Hainan, China, 18–20 December 2022; pp. 303–310. [Google Scholar] [CrossRef]
Zhai, W.; Wang, F.; Liu, L.; Ding, Y.; Lu, W. Federated Semi-Supervised and Semi-Asynchronous Learning for Anomaly Detection in IoT Networks. arXiv 2023, arXiv:2308.11981. [Google Scholar]
Cui, X.; Han, X.; Liu, G.; Zuo, W.; Wang, Z. Communication-Efficient Federated Learning for Network Traffic Anomaly Detection. In Proceedings of the 2023 19th International Conference on Mobility, Sensing and Networking (MSN), Nanjing, China, 14–16 December 2023; pp. 398–405. [Google Scholar] [CrossRef]
Alkulaib, L. Adaptive Hierarchical GHSOM with Federated Learning for Context-Aware Anomaly Detection in IoT Networks. In Proceedings of the 2024 IEEE International Conference on Big Data (BigData), Washington, DC, USA, 15–18 December 2024; pp. 5917–5925. [Google Scholar] [CrossRef]
Raeiszadeh, M.; Ebrahimzadeh, A.; Glitho, R.H.; Eker, J.; Mini, R.A.F. Asynchronous Real-Time Federated Learning for Anomaly Detection in Microservice Cloud Applications. IEEE Trans. Mach. Learn. Commun. Netw. 2025, 3, 176–194. [Google Scholar] [CrossRef]
Gómez, Á.L.P.; Beltrán, E.T.M.; Sánchez, P.M.S.; Celdrán, A.H. TemporalFED: Detecting cyberattacks in industrial time-series data using decentralized federated learning. arXiv 2023, arXiv:2308.03554. [Google Scholar]
Ma, T.; Wang, H.; Li, C. Quantized Distributed Federated Learning for Industrial Internet of Things. IEEE Internet Things J. 2023, 10, 3027–3036. [Google Scholar] [CrossRef]
Qiu, W.; Ai, W.; Chen, H.; Feng, Q.; Tang, G. Decentralized Federated Learning for Industrial IoT With Deep Echo State Networks. IEEE Trans. Ind. Inform. 2023, 19, 5849–5857. [Google Scholar] [CrossRef]
Ma, Q.; Xu, Y.; Xu, H.; Jiang, Z.; Huang, L.; Huang, H. FedSA: A Semi-Asynchronous Federated Learning Mechanism in Heterogeneous Edge Computing. IEEE J. Sel. Areas Commun. 2021, 39, 3654–3672. [Google Scholar] [CrossRef]
Kim, Y.; Hakim, E.A.; Haraldson, J.; Eriksson, H.; da Silva, J.M.B.; Fischione, C. Dynamic Clustering in Federated Learning. In Proceedings of the ICC 2021—IEEE International Conference on Communications, Montreal, QC, Canada, 14–23 June 2021; pp. 1–6. [Google Scholar] [CrossRef]
Liberti, F.; Berardi, D.; Martini, B. Federated Learning in Dynamic and Heterogeneous Environments: Advantages, Performances, and Privacy Problems. Appl. Sci. 2024, 14, 8490. [Google Scholar] [CrossRef]
Augello, A.; Falzone, G.; Re, G.L. DCFL: Dynamic Clustered Federated Learning under Differential Privacy Settings. In Proceedings of the 2023 IEEE International Conference on Pervasive Computing and Communications Workshops and Other Affiliated Events (PerCom Workshops), Atlanta, GA, USA, 13–17 March 2023; pp. 614–619. [Google Scholar] [CrossRef]
Mishra, R.; Gupta, H.P.; Banga, G.; Das, S.K. Fed-RAC: Resource-Aware Clustering for Tackling Heterogeneity of Par-ticipants in Federated Learning. IEEE Trans. Parallel Distrib. Syst. 2024, 35, 1207–1220. [Google Scholar] [CrossRef]
Zhao, H.; Sun, X.; Dong, J.; Chen, C.; Dong, Z. Highlight Every Step: Knowledge Distillation via Collaborative Teaching. IEEE Trans. Cybern. 2022, 52, 2070–2081. [Google Scholar] [CrossRef] [PubMed]
Du, R.; Xu, S.; Zhang, R.; Xu, L.; Xia, H. A dynamic adaptive iterative clustered federated learning scheme. Knowl.-Based Syst. 2023, 276, 110741. [Google Scholar] [CrossRef]
Sattler, F.; Müller, K.-R.; Samek, W. Clustered Federated Learning: Model-Agnostic Distributed Multitask Optimization Under Privacy Constraints. IEEE Trans. Neural Netw. Learn. Syst. 2021, 32, 3710–3722. [Google Scholar] [CrossRef] [PubMed]
Zhang, Y.; Chen, H.; Lin, Z.; Chen, Z.; Zhao, J. FedAC: An Adaptive Clustered Federated Learning Framework for Het-erogeneous Data. arXiv 2024, arXiv:2403.16460. [Google Scholar]
Duan, M.; Liu, D.; Ji, X.; Liu, R.; Liang, L.; Chen, X.; Tan, Y. FedGroup: Efficient Federated Learning via Decomposed Similarity-Based Clustering. In Proceedings of the 2021 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Commu-nications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom), New York City, NY, USA, 30 September–3 October 2021; pp. 228–237. [Google Scholar] [CrossRef]
Long, G.; Xie, M.; Shen, T.; Zhou, T.; Wang, X.; Jiang, J. Multi-center federated learning: Clients clustering for better per-sonalization. World Wide Web 2023, 26, 481–500. [Google Scholar] [CrossRef]
Li, S.; Zhu, C. Towards client driven federated learning. arXiv 2024, arXiv:2405.15407. [Google Scholar] [CrossRef]
Mansour, Y.; Mohri, M.; Ro, J.; Suresh, A.T. Three approaches for personalization with applications to federated learn-ing. arXiv 2020, arXiv:2002.10619. [Google Scholar] [CrossRef]
Ruan, Y.; Joe-Wong, C. Fedsoft: Soft clustered federated learning with proximal local updating. Proc. AAAI Conf. Artif. Intell. 2022, 36, 8124–8131. [Google Scholar] [CrossRef]
Khan, A.F.; Wang, X.; Le, Q.; Khan, A.A.; Ali, H.; Jin, M.; Ding, J.; Butt, A.R.; Anwar, A. IP-FL: Incentivized and Personalized Federated Learning. arXiv 2023, arXiv:2304.07514. [Google Scholar] [CrossRef]
Ghosh, A.; Chung, J.; Yin, D.; Ramchandran, K. An Efficient Framework for Clustered Federated Learning. IEEE Trans. Inf. Theory 2022, 68, 8076–8091. [Google Scholar] [CrossRef]
Ali, S.S.; Kumar, A.; Ali, M.; Singh, A.K.; Choi, B.J. Temporal Adaptive Clustering for Heterogeneous Clients in Federated Learning. In Proceedings of the 2024 International Conference on Information Networking (ICOIN), Ho Chi Minh City, Vietnam, 17–19 January 2024; pp. 11–16. [Google Scholar] [CrossRef]
Ni, Z.; Hashemi, M. Efficient Cluster Selection for Personalized Federated Learning: A Multi-Armed Bandit Approach. In Proceedings of the 2023 IEEE Virtual Conference on Communications (VCC), NY, USA, 28–30 November 2023; pp. 115–120. [Google Scholar] [CrossRef]
Larsson, H.; Riaz, H.; Ickin, S. Automated collaborator selection for federated learning with multi-armed bandit agents. In Proceedings of the 4th FlexNets Workshop on Flexible Networks Artificial Intelligence Supported Network Flexibility and Agility, Virtual Event, USA, 23 August 2021; pp. 44–49. [Google Scholar] [CrossRef]
Xia, W.; Quek, T.Q.S.; Guo, K.; Wen, W.; Yang, H.H.; Zhu, H. Multi-Armed Bandit-Based Client Scheduling for Federated Learning. IEEE Trans. Wirel. Commun. 2020, 19, 7108–7123. [Google Scholar] [CrossRef]
Anwar, A.; Moser, B.; Herurkar, D.; Raue, F.; Hegiste, V.; Legler, T.; Dengel, A. FedAD-Bench: A Unified Benchmark for Federated Unsupervised Anomaly Detection in Tabular Data. In Proceedings of the 2024 2nd International Conference on Federated Learning Technologies and Applications (FLTA), Valencia, Spain, 17–20 September 2024; pp. 115–122. [Google Scholar] [CrossRef]
Islam, M.S.; Rakha, M.S.; Pourmajidi; Sivaloganathan, J.; Steinbacher, J.; Miranskyy, A. Anomaly Detection in Large-Scale Cloud Systems: An Industry Case and Dataset. arXiv 2024, arXiv:2411.09047. [Google Scholar]
Dehlaghi-Ghadim, A.; Markovic, T.; Leon, M.; Söderman, D.; Strandberg, P.E. Federated Learning for Network Anomaly Detection in a Distributed Industrial Environment. In Proceedings of the 2023 International Conference on Machine Learning and Applications (ICMLA), Jacksonville, FL, USA, 15–17 December 2023; pp. 218–225. [Google Scholar] [CrossRef]
Alhammadi, R.; Gawanmeh, A.; Atalla, S.; Alkhatib, M.Q.; Mansoor, W. Performance Evaluation of Federated Learning for Anomaly Network Detection. In Proceedings of the 2023 Congress in Computer Science, Computer Engineering, & Applied Computing (CSCE), Las Vegas, NV, USA, 24–27 July 2023; pp. 116–122. [Google Scholar] [CrossRef]
Vucovich, M.; Tarcar, A.; Rebelo, P.; Rahman, A.; Nandakumar, D.; Redino, C.; Choi, K.; Schiller, R.; Bhattacharya, S.; Veeramani, B.; et al. Anomaly Detection via Federated Learning. In Proceedings of the 2023 33rd International Telecommunication Networks and Applications Conference, Melbourne, Australia, 29 November–1 December 2023; pp. 259–266. [Google Scholar] [CrossRef]
Mahmud, S.A.; Islam, N.; Islam, Z.; Rahman, Z.; Mehedi, S.T. Privacy-Preserving Federated Learning-Based Intrusion Detection Technique for Cyber-Physical Systems. Mathematics 2024, 12, 3194. [Google Scholar] [CrossRef]
Kaur, A. Intrusion Detection Approach for Industrial Internet of Things Traffic Using Deep Recurrent Reinforcement Learning Assisted Federated Learning. IEEE Trans. Artif. Intell. 2025, 6, 37–50. [Google Scholar] [CrossRef]
Kea, K.; Han, Y.; Min, Y.-J. A Federated Learning Approach for Efficient Anomaly Detection in Electric Power Steering Systems. IEEE Access 2024, 12, 67525–67536. [Google Scholar] [CrossRef]
Tayeen, A.S.M.; Misra, S.; Cao, H.; Harikumar, J. CAFNet: Compressed Autoencoder-based Federated Network for Anomaly Detection. In Proceedings of the MILCOM 2023–2023 IEEE Military Communications Conference (MILCOM), Boston, MA, USA, 30 October–3 November 2023; pp. 325–330. [Google Scholar] [CrossRef]
Albshaier, L.; Almarri, S.; Albuali, A. Federated Learning for Cloud and Edge Security: A Systematic Review of Challenges and AI Opportunities. Electronics 2025, 14, 1019. [Google Scholar] [CrossRef]
Liu, X.; Shi, T.; Xie, C.; Li, Q.; Hu, K.; Kim, H.; Xu, X.; Vu-Le, T.A.; Huang, Z.; Nourian, A.; et al. UniFed: All-in-One Federated Learning Platform to Unify Open-Source Frameworks. arXiv 2022, arXiv:2207.10308. [Google Scholar]
Zatsarenko, R.; Chuprov, S.; Korobeinikov, D.; Reznik, L. Trust-Based Anomaly Detection in Federated Edge Learning. In Proceedings of the 2024 IEEE World AI IoT Congress (AIIoT), Seattle, WA, USA, 29–31 May 2024; pp. 273–279. [Google Scholar] [CrossRef]
Idrissi, M.J.; Alami, H.; El Mahdaouy, A.; El Mekki, A.; Oualil, S.; Yartaoui, Z.; Berrada, I. Fed-ANIDS: Federated learning for anomaly-based network intrusion detection systems. Expert Syst. Appl. 2023, 234, 121000. [Google Scholar] [CrossRef]
Xie, C.; Koyejo, S.; Gupta, I. Asynchronous federated optimization. arXiv 2019, arXiv:1903.03934. [Google Scholar]
Sattler, F.; Wiedemann, S.; Müller, K.-R.; Samek, W. Robust and Communication-Efficient Federated Learning From Non-i.i.d. Data. IEEE Trans. Neural Netw. Learn. Syst. 2020, 31, 3400–3413. [Google Scholar] [CrossRef]
Wang, Q.; Yang, Q.; He, S.; Shi, Z.; Chen, J. AsyncFedED: Asynchronous federated learning with Euclidean distance based adaptive weight aggregation. arXiv 2022, arXiv:2205.13797. [Google Scholar]
Shi, C.; Zhao, H.; Zhang, B.; Zhou, M.; Guo, D.; Chang, Y. FedAWA: Adaptive Optimization of Aggregation Weights in Federated Learning Using Client Vectors. arXiv 2025, arXiv:2503.15842. [Google Scholar]
He, C.; Li, S.; So, J.; Zeng, X.; Zhang, M.; Wang, H.; Avestimehr, S. FedML: A research library and benchmark for federated machine learning. arXiv 2023, arXiv:2007.13518. [Google Scholar]

Figure 1. Research methodology.

Figure 2. Number of papers based on publication year.

Figure 3. Number of papers based on publication type.

Table 1. Studies that applied Federated Learning for anomaly detection.

Paper Title	Problem Addressed	Detection Method	Key Findings	Performance
Anomaly Detection from Distributed Data Sources via Federated Learning [19]	FL-based anomaly detection across decentralised data sources while preserving data privacy.	Unsupervised [Isolation Forest (IF)]	The experiment results show that the supervised model (MLP) achieves the best performance in every scenario (centralised, FL-15 client, FL-30 client) while the unsupervised model (IF) has the worst performance. MLP has its best result in a centralised setting, with slight drops in FL with 15 clients and a significant decline in FL with 30 clients. This result shows that although the number of clients can improve the performance of FL as it increases, the FL performance can decrease as the number of clients increases beyond a certain point. This means the scalability of FL remains a challenge. IF fails to detect anomalies in all scenarios, especially in FL with 30 clients. The semi-supervised model (GMM) shows high variability and unpredictable performance. Nevertheless, GMM has a more stable performance compared to IF.	Credit-Card Fraud Dataset Centralised MLP Precision: 87.5% Recall: 85.7% F2: 86.1% GMM Precision: 36.6% Recall: 39.8% F2: 39.1% IF Precision: 13.3% Recall: 4.1% F2: 4.7%
		Semi-supervised [Gaussian Mixture Models (GMM)]		FL-15 Clients MLP Precision: 85.4% Recall: 71.4% F2: 73.8% GMM Precision: 40.7% Recall: 52.4% F2: 48.6% IF Precision: 0.9% Recall: 16.3% F2: 3.8%
		Supervised [MLP]		FL-30 Clients MLP Precision: 78.6% Recall: 44.9% F2: 49.1% GMM Precision: 78.6% Recall: 44.9% F2: 49.1% IF Precision: 0.0% Recall: 0.0% F2: 0.0%
Federated Learning for Anomaly Detection in Vehicular Networks [25]	Intrusion and anomaly detection in Internet of Vehicles (IoV) using FL.	Supervised [LSTM] [CNN-LSTM]	This experiment applies supervised FL method for anomaly detection in vehicular networks. The FL is developed based on Federated Averaging (FedAvg) aggregation method and the FedAvg with Adam Optimiser has better performance compared to FedProx algorithm. This study compares the performance of LSTM and hybrid CNN-LSTM models under IID (Random) and Non-IID (Quad) conditions. The experiment results show that in the IID (Random) setting, where each client has the same data type and data distribution, CNN-LSTM has a better overall performance compared to LSTM. However, in non-IID settings (Quad), where data type and data distribution are varied across clients, LSTM performs better than CNN-LSTM. Both of the models perform better in IID scenarios compared to non-IID settings.	FedAvg-Adam (Random) LSTM Accuracy: 95.54% Precision: 92.79% Recall: 99.55% F1: 96.05% CNN-LSTM Accuracy: 95.60% Precision: 92.96% Recall: 99.44% F1: 96.09% FedAvg-Adam (Quad) LSTM Accuracy: 95.10% Precision: 92.51% Recall: 99.03% F1: 95.66% CNN-LSTM Accuracy: 94.38% Precision: 92.17% Recall: 98.02% F1: 95.00%
Federated learning-based intrusion detection system for the internet of things using unsupervised and supervised deep learning models [20]	Scalable intrusion detection in heterogeneous IoT networks using FL.	Unsupervised [Autoencoder (AE)] Supervised [Deep Neural Network (DNN)]	This study aims to compare supervised and unsupervised FL methods for anomaly detection in IoT devices. The supervised model is DNN while the unsupervised model is an autoencoder (AE). The experiment uses accuracy, precision, recall, F1-score, true-positive rate (TPR) and false-positive rate (FPR) as performance metrics. The result shows that FL DNN achieves the same performance to non-FL DNN across all the metrics, except that non-FL DNN has higher FPR compared to FL DNN. A similar trend is shown in the AE model, where FL AE outperforms non-FL AE in FPR. Comparing the FL DNN and FL AE, the unsupervised FL AE model performs better than the supervised FL DNN model with a lower FPR.
Anomaly detection based on LSTM and autoencoders using federated learning in smart electric grid [27]	Federated deep learning for smart grid anomaly detection	Unsupervised [LSTM + Autoencoder (AE)]	This study applies an unsupervised LSTM-Autoencoder model for anomaly detection in a smart electric grid system. The experiment compares two anomaly detection standards, which are Mean Standard Deviation (MSD) and Median Absolute Deviation (MAD). Homomorphic encryption (HE) is applied in experiments to prevent sensitive data exposure. The models are tested with different threshold value (K). All of the models show the best performance in threshold value K = 5 with HE 128-bit key applied. The final result shows that MSD performs better than MAD across all performance metrics.	Mean Standard Deviation (MSD) K = 5 HE = 128 bits Accuracy: 98% Precision: 97% Recall: 98% F1: 97% Median Absolute Deviation (MAD) K = 5 HE = 128 bits Accuracy: 79% Precision: 85% Recall: 79% F1: 78%
Anomaly Detection through Unsupervised Federated Learning [21]	Unsupervised anomaly detection on decentralised, non-IID edge data via FL.	Unsupervised [Autoencoder (AE)]	This experiment proposes an innovative FL method, which uses community detection to improve the accuracy of anomaly detection. Firstly, the OC-SVM model is used to group the clients based on their data characteristic. After the clients are grouped into distinct communities, each community collaborates to train an autoencoder model under federated settings for anomaly detection. This study shows the feasibility of community-based FL where the clients can be grouped into clusters or communities based on their characteristics.
Effectively detecting and diagnosing distributed multivariate time series anomalies via Unsupervised Federated Hypernetwork [26]	Federated anomaly detection and localisation for distributed multivariate time series.	Unsupervised [self-proposed—uFedHy-DisMTSADD model]	The proposed Unsupervised Federated Hypernetwork Method for Distributed Multivariate Time-Series Anomaly Detection and Diagnosis (uFedHy-DisMTSADD) allows for collaborative model training while ensuring the data privacy of each client. The core component of this model is the Federated Hypernetwork Architecture which effectively solves the problem of data heterogeneity and fluctuations in distributed environments. This model integrates with Series-Conversion Normalisation Transformer (SC Nor-Transformer), which improves the accuracy of anomaly detection by enhancing the temporal dependence of subsequences. SC Nor-Transformer is able to handle timing biases that arise during model aggregation which can help to boost the robustness of FL system.	Compared to baseline models, the proposed model improves by achieving an average F1score increase of 9.19% and an average AUROC increase of 2.41%.
Semi-supervised federated learning for collaborative security threat detection in control system for distributed power generation [22]	Semi-supervised intrusion detection in distributed power systems under privacy constraints.	Semi-supervised [Federated Uncertainty Aware Pseudo-label Selection (FedUPS)]	The proposed Federated Uncertainty-aware Pseudo-label Selection (FedUPS) framework combines the concept of semi-supervised learning and Federated Learning for anomaly detection in security systems. Convolutional Neural Networks (CNNs) are utilised in this framework to identify security threats in distributed power-generation systems. The Uncertainty-aware Pseudo-label Selection (UPS) module is integrated into this framework. By integrating the UPS module, the model can effectively handle the unlabelled data while ensuring the credibility of pseudo-labels. This framework not only enhances the accuracy of the anomaly detection system but also ensures data privacy. However, the proposed FedUPS framework has a limitation, as the communication and computational overhead rise when the number of clients increases. This may limit the scalability of the model in real-world applications.	FedUPS Accuracy: 86.47% Precision: 91.12% Recall: 89.86% F1: 86.57%
FedKD-IDS: A robust intrusion detection system using knowledge distillation-based semi-supervised federated learning and anti-poisoning attack mechanism [23]	Robust FL-based IoT intrusion detection tackling Non-IID data, limited labels, and collaborative poisoning threats.	Semi-supervised [FedKD-IDS]	The proposed FedKD-IDS framework combines the concept of semi-supervised Federated Learning (SSFL) with knowledge distillation (KD) for intrusion detection system (IDS). This method can handle both labelled and unlabelled data, making it able to learn from diverse datasets without requiring extensive manual labelling. The KD mechanism allows the FL-based models to share the logits instead of model weights. This can significantly reduce the communication overhead and enhance data privacy.	Malicious Collaborator Rate: 50% FedKD-IDS Accuracy: 79.09% Precision: 79.09% Recall: 73.14% F1: 75.55% SSFL Accuracy: 19.86% Precision: 19.51% Recall: 15.70% F1: 17.40%
FedMSE:Semi-supervised federated learning approach for IoT network intrusion detection [24]	Secure IoT intrusion detection under heterogeneity via SAE-CEN and MSE-aware aggregation.	Semi-supervised	The proposed FedMSE framework utilised a hybrid model that combines Shrink Autoencoder (SAE) and centroid one-class classifier (CEN) for IoT network intrusion detection. SAE is leveraged for feature representation, which is responsible for compressing high-dimensional network data into a lower-dimensional latent space. CEN identifies anomalies by measuring the distance of data points from the centroid of normal data in the latent space. This framework applies semi-supervised learning, where the model is trained with normal data, while the data which deviates from the normal data is categorised as anomalies. Moreover, the MSEAvg aggregation method is utilised for the global model update. MSRAvg assigns weights to local models based on their reconstruction error (MSE). The models with lower MSE are given higher priority in global model aggregation. This method can effectively improve global model accuracy because the global model is less influenced by the poor performance of clients. The experiment results show that the proposed FedMSE algorithm achieves better performance compared to the same model that uses FedAvg and FedProx during the global aggregation. However, the limitation of the proposed solution is the computational overhead of MSEAvg. This is because the server side needs to calculate the MSE for every client during the global aggregation round which increases the computational overhead on the server side.	SAE-CEN (accuracy) FedAvg: 96.93 ± 0.70 FedProx: 97.28 ± 0.84 FedMSE: 97.3 ± 0.49

Table 2. Search keywords and Boolean search string.

Keyword	Boolean Search String
Federated Learning	“Federated Learning” OR “FL” OR “Decentralized Learning” OR “Distributed Federated Learning”
Anomaly Detection	“Anomaly Detection” OR “Intrusion Detection” OR “Outlier Detection” OR “Fault Detection”
Scalability	“Scalability” OR “Elasticity” OR “Load Adaptability”
Real-time Performance	“Real-time performance” OR “Low Latency” OR “Response Time”
Dynamic Server Clusters	“Dynamic Server Clusters” OR “Adaptive Server Network” OR “Scalable Nodes” OR “Dynamic Infrastructure”
Benchmarking Framework	“Benchmarking framework” OR “Performance Evaluation” OR “Evaluation Metrics” OR “Standardized Assessment” OR “FL Benchmark” OR “Anomaly Detection Metrics”

Table 3. Inclusion and exclusion criteria.

Inclusion	Criteria	Exclusion
Journal and conference	Publication Type	White paper, thesis, dissertation
2020–2025	Publication Year	Before 2020
English	Publication Language	Non-English
Related	Related to RQs	Non-Related

Table 4. Number of papers in each round of filtering.

	Round1	Round2			Round3
	All	RQ1	RQ2	RQ3	RQ1	RQ2	RQ3
Scopus	323	57	86	50	6	14	11
IEEE	723	29	86	47	3	1	0
ArXiv	86	9	12	9	2	4	2
Total		95	184	106	11	19	13
Total	1132	385			43

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lim, L.-H.; Ong, L.-Y.; Leow, M.-C. Federated Learning for Anomaly Detection: A Systematic Review on Scalability, Adaptability, and Benchmarking Framework. Future Internet 2025, 17, 375. https://doi.org/10.3390/fi17080375

AMA Style

Lim L-H, Ong L-Y, Leow M-C. Federated Learning for Anomaly Detection: A Systematic Review on Scalability, Adaptability, and Benchmarking Framework. Future Internet. 2025; 17(8):375. https://doi.org/10.3390/fi17080375

Chicago/Turabian Style

Lim, Le-Hang, Lee-Yeng Ong, and Meng-Chew Leow. 2025. "Federated Learning for Anomaly Detection: A Systematic Review on Scalability, Adaptability, and Benchmarking Framework" Future Internet 17, no. 8: 375. https://doi.org/10.3390/fi17080375

APA Style

Lim, L.-H., Ong, L.-Y., & Leow, M.-C. (2025). Federated Learning for Anomaly Detection: A Systematic Review on Scalability, Adaptability, and Benchmarking Framework. Future Internet, 17(8), 375. https://doi.org/10.3390/fi17080375

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Federated Learning for Anomaly Detection: A Systematic Review on Scalability, Adaptability, and Benchmarking Framework

Abstract

1. Introduction

2. Overview of FL

3. Materials and Methods

3.1. Research Questions

3.2. Data Search Strategies

3.3. Paper Selection

3.4. Data Extraction and Synthesis

4. Results and Discussion

4.1. Research Question 1—How Does Communication Overhead Impact the Scalability and Real-Time Performance of Federated Learning for Anomaly Detection in Distributed Server Environments?

4.2. Research Question 2—How Can a Federated Learning Framework Adapt to Dynamic Server Clusters?

4.3. Research Question 3—What Key Components Are Required to Develop a Standardised Benchmarking Framework for Evaluating the Performance of Federated Learning Anomaly Detection?

4.4. Dataset

4.5. Experiment Setup

4.6. Performance Metrics

5. Discussion

6. Limitations of the Study

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI