Detecting Emerging DGA Malware in Federated Environments via Variational Autoencoder-Based Clustering and Resource-Aware Client Selection

Duc, Ma Viet; Dang, Pham Minh; Phuong, Tran Thu; Truong, Truong Duc; Hai, Vu; Thanh, Nguyen Huu

doi:10.3390/fi17070299

Open AccessArticle

Detecting Emerging DGA Malware in Federated Environments via Variational Autoencoder-Based Clustering and Resource-Aware Client Selection

by

Ma Viet Duc

,

Pham Minh Dang

,

Tran Thu Phuong

,

Truong Duc Truong

,

Vu Hai

and

Nguyen Huu Thanh

^*

School of Electrical and Electronic Engineering, Hanoi University of Science and Technology, Hanoi 100000, Vietnam

^*

Author to whom correspondence should be addressed.

Future Internet 2025, 17(7), 299; https://doi.org/10.3390/fi17070299

Submission received: 24 May 2025 / Revised: 27 June 2025 / Accepted: 27 June 2025 / Published: 3 July 2025

(This article belongs to the Special Issue Security of Computer System and Network)

Download

Browse Figures

Versions Notes

Abstract

Domain Generation Algorithms (DGAs) remain a persistent technique used by modern malware to establish stealthy command-and-control (C&C) channels, thereby evading traditional blacklist-based defenses. Detecting such evolving threats is especially challenging in decentralized environments where raw traffic data cannot be aggregated due to privacy or policy constraints. To address this, we present FedSAGE, a security-aware federated intrusion detection framework that combines Variational Autoencoder (VAE)-based latent representation learning with unsupervised clustering and resource-efficient client selection. Each client encodes its local domain traffic into a semantic latent space using a shared, pre-trained VAE trained solely on benign domains. These embeddings are clustered via affinity propagation to group clients with similar data distributions and identify outliers indicative of novel threats without requiring any labeled DGA samples. Within each cluster, FedSAGE selects only the fastest clients for training, balancing computational constraints with threat visibility. Experimental results from the multi-zones DGA dataset show that FedSAGE improves detection accuracy by up to 11.6% and reduces energy consumption by up to 93.8% compared to standard FedAvg under non-IID conditions. Notably, the latent clustering perfectly recovers ground-truth DGA family zones, enabling effective anomaly detection in a fully unsupervised manner while remaining privacy-preserving. These foundations demonstrate that FedSAGE is a practical and lightweight approach for decentralized detection of evasive malware, offering a viable solution for secure and adaptive defense in resource-constrained edge environments.

Keywords:

network security; intrusion detection systems; federated learning; DGA detection

1. Introduction

Intrusion Detection Systems (IDS) are important components in modern cybersecurity infrastructures, designed to monitor network traffic and detect attacks, unauthorized access, or abnormal behavior [1]. One of the emerging threats in Internet access is the use of Domain Generation Algorithms (DGA), which enable malware to generate a large amount of pseudo-random domain names to establish command-and-control (C&C) communication [2,3,4]. More specifically, a benign domain might look like google.com, whereas a DGA-generated domain could be agsdjasa98.org, containing seemingly random characters and lacking the familiar grammatical structure that humans are accustomed to. This dynamic behavior makes traditional rule-based IDS ineffective, as they struggle to detect novel or previously unseen domains.

At the same time, Federated Learning (FL) [5] has emerged as a promising approach to collaboratively train models across multiple decentralized devices without requiring raw data exchange. FL is particularly beneficial for IDS in distributed environments [6], such as edge computing or IoT networks, where data privacy and system scalability are crucial. However, FL introduces new challenges, particularly non-Independent and Identically Distributed (non-IID) data distributions, where each client’s data reflects different local patterns and system heterogeneity—for instance, one router mostly sees corporate traffic while another sees home-user traffic, where clients vary in computational power and network conditions.

Traditional FL approaches, such as FedAvg [5], often assume that all participating clients contribute equally. In practice, selecting clients solely based on computational capability may lead to the exclusion of “stragglers” (slow or resource-constrained devices) [7], some of which may hold critical or rare data, such as indicators of new DGA attacks. Several FL frameworks attempt to mitigate non-IID data by clustering clients according to the similarity of their model parameters (e.g., weight vectors) [8,9,10,11], where similar parameters imply similar data distributions. However, this assumption often breaks down because, in order to improve the accuracy of clustering based on model parameters, the clients must be trained on their local data after each training round so that they can learn the characteristics of the data they possess. Therefore, these limitations may incur high communication and computational costs [12].

To overcome these challenges, we propose FedSAGE, a novel FL framework that integrates Variational Autoencoders (VAEs) for novelty client clustering. In our approach, each client leverages a shared VAE to project its local data into a compressed latent space representation. These latent vectors capture essential client data features and perform unsupervised clustering by using affinity propagation without requiring raw data sharing. This helps to group clients with similar data characteristics, especially in non-IID environments. Within each cluster, we apply a computation-aware selection strategy to choose only the fastest clients for participation in training, ensuring both training efficiency and energy savings. Consequently, this enables the system to detect different types of DGA attacks and avoid mistakenly removing clients that contain important attack-related data.

The main contributions of this work are summarized as follows:

We introduce lightweight novelty detection where we leverage VAE latent space representations for accurate, privacy-preserving client unsupervised clustering for anomaly detection.
We propose FedSAGE, a federated clustering and client selection framework that jointly addresses data heterogeneity and system heterogeneity.
We evaluate our proposed method on a realistic testbed with multi-zone DGA for non-IID datasets and demonstrate the accuracy improvements in detection, clustering quality without sharing local data, and energy efficiency.

These questions form the foundation of our proposed solution presented in the next sections.

2. Related Work and Motivation

This section reviews the use of FL in IDS systems and the limitations in DGA detection, which motivate our proposed solution.

2.1. Federated Learning

Federated Learning (FL) [5,7,13] is a decentralized machine learning paradigm where multiple clients (such as edge devices or distributed nodes) collaborate to train a shared model without exposing their raw data. FL is particularly well-suited for scenarios involving sensitive or distributed data, such as healthcare, finance, or network security [14]. FedAvg [5] is a FL foundation algorithm where clients perform local updates and send model parameters to a central server for aggregation. While it reduces communication cost and supports scalability, FedAvg suffers under non-Independent and Identically Distributed (non-IID) data distributions [7,13,14], where client data varies due to user behavior, geographic location, or deployment context, which can significantly hinder the convergence of the global model and degrade global model performance.

To address the non-IID data distribution challenges in FL, many studies have proposed clustering-based solutions. Instead of aggregating model updates from all clients regardless of their data characteristics, the core idea is grouping clients with similar data distributions, which can reduce gradient conflicts, stabilize training, and improve convergence [8,9,15]. In essence, clustering acts as a regularization mechanism that mitigates the divergence caused by data heterogeneity across multiple client devices.

Based on the idea and effectiveness of clustering clients, several studies have been conducted. Tian et al. [8] developed WSCC, a clustering method based on weight similarity, which groups clients according to the similarity of their model characteristics after local training. This approach enhances stability under non-IID data conditions without requiring access to individual client data. This enhances learning stability under non-IID settings without requiring access to client data. K-FL [15] leverages a Kalman filter to dynamically estimate client similarity, reducing computation time and training variance. FedPVD [9] improves robustness by clustering clients according to the directional changes in their model parameters across rounds. IHC-FL [10] enhances performance by enabling intra-cluster update exchanges, thereby minimizing global communication overhead and local overfitting. Morafah et al. proposed FLIS [11], a clustered FL approach designed to mitigate the impact of non-IID data by grouping clients based on the similarity of their model outputs on a shared auxiliary dataset, allowing it to more accurately reflect the underlying data distribution.

Table 1 presents a comparison of FL methods that use client clustering techniques. Among them, weight-based approaches (WSCC, K-FL) are simple but require transmission of large models. FLIS effectively addresses strong non-IID scenarios but demands a shared auxiliary dataset, which conflicts with privacy objectives. FedSAGE, on the other hand, only transmits small hidden vectors and requires neither labels nor auxiliary data, making it lightweight in communication and better at protecting raw data.

Despite these advances, clustering based on model parameters has notable limitations. Model parameters do not always accurately reflect the underlying characteristics of the input data [16], especially when local training is minimal. To improve clustering precision, such approaches typically require multiple rounds of local training per client, increasing computation costs. Furthermore, transmitting large model parameters for clustering purposes adds significant communication overhead [12], which is especially problematic when working with large-scale models. Therefore, to develop a stable FL system under such complex conditions, it is essential to design a client clustering method that captures the underlying data distribution characteristics while ensuring strong privacy protection, high accuracy, and low communication cost.

Table 1. Client clustering methods in federated learning. (✓: required; ✗: not required).

Method	Needs Labels?	Needs Auxiliary or Shared Data?	Clustered/Similarity Signal	Extra Communication vs. FedAvg	Computational Overhead
WSCC [8]	✗	✗	Full model weights; cosine/ $ℓ 2$ distance	High—entire weight tensors transferred for pairwise comparison	Low (simple distance calc.)
K-FL [15]	✗	✗	Kalman-filtered weight trajectories	Moderate	Low
FedPVD [9]	✗	✗	Direction of weight change (ΔW) across rounds	Moderate	Low
IHC-FL [10]	✗	✗	Weight similarity + intra-cluster model exchange	Moderate to high	Medium
FLIS [11]	✓ (labels on aux-set)	✓ (server keeps small labeled dataset)	Output logits on shared aux-data	Low (only logits)	High (server runs inference K times per round)
FedSAGE (this work)	✗	✗	Latent $μ$ -vector from pre-trained VAE	Low (only vector size $1 \times d$ )	Low

2.2. Federated Learning for Intrusion Detection

Intrusion Detection Systems (IDS) [1] are security solutions designed to monitor, analyze, and detect malicious activities or policy violations within a network or host system. IDSs are an essential component of cybersecurity, designed to analyze data packets as they travel across the network to detect potential threats before they reach critical systems. By monitoring for suspicious patterns, unauthorized access, or known attack signatures, IDS helps organizations prevent security breaches and maintain network integrity. Traditional IDS approaches, such as rule-based and signature-based systems, rely on predefined heuristics to identify known threats. While computationally efficient, these methods are ineffective against zero-day attacks and novel threat patterns [17,18].

To overcome these limitations, recent research has implemented Machine Learning (ML) into IDS [17,18,19,20], which enables the automatic identification of previously unseen attacks and complex behaviors. ML-based IDS can generalize from historical traffic patterns, making them more adaptive and scalable. However, most ML-based systems require centralized data collection, raising privacy concerns and scalability issues in distributed environments.

In the context of IDS, these systems are often distributed across routers, IoT gateways, or edge servers [6,21,22], where each node only observes a local slice of traffic. Raw logs (such as PCAP or DNS) often contain sensitive information. FL is naturally suited to this context because it enables training a global model without transferring raw logs. Specifically, each IDS node trains locally on its own traffic and only uploads model weights (updates) to the server [4,6,7].

For example, consider a smart home with an IP camera, a thermostat, and a smart TV. The camera primarily handles video streaming, the smart TV sends DNS queries to over-the-top services, while the thermostat connects periodically to a control server. These flows exhibit non-IID traffic patterns, yet with FL, all three devices can contribute to a shared “IDS brain” without transmitting images or raw packets. If the camera detects a suspicious domain like agsdjasa98.org (potentially DGA-related) that the smart TV does not, the global model can still learn this attack pattern and trigger alerts across the entire network.

This approach helps handle data heterogeneity across different networks by allowing models to be trained on site-specific traffic patterns while contributing to a broader, more robust intrusion detection framework. FL also reduces communication costs by transmitting only model parameters instead of raw network logs. As a result, the amount of communicated data does not scale proportionally with the size of the client’s local dataset, leading to relatively low bandwidth consumption. At the same time, FL enhances security through techniques such as secure computation, making it more resistant to model poisoning attacks and reducing the risks associated with data breaches. Moreover, deep learning-based IDS solutions can be integrated into FL to continuously learn from new attack patterns, swiftly adapt to emerging threat trends, and strengthen IDS defenses against network threats in real time, without requiring extensive human intervention.

2.3. Limitations of Existing DGA Detection Methods

Domain names play a critical role in the functionality of Internet access. A normal domain name (also called benign domain) is a human-readable address used to identify resources on the Internet, serving as an accessible alias for complex IP addresses. They are made by humans and follow standard naming conventions, making them readable and understandable (e.g., google.com, youtube.com).

On the other hand, Domain Generation Algorithms (DGAs) [2,3] automatically generate domain names, often used by malware to generate an extensive quantity of random domains based on predefined rules. The primary purpose of DGAs is to enable malware, botnets, and other cyber threats to dynamically generate domain names for command-and-control (C&C) [23] communication and evasion of security measures. Attackers can leverage Command and Control (C&C) servers to remotely manage compromised devices. Through this channel, they are able to steal sensitive data, distribute additional malware, orchestrate botnets for Distributed Denial-of-Service (DDoS) attacks, monitor target systems, and maintain persistent unauthorized access to the victim’s network [24]. In a typical cyber attack, malware or botnets require a reliable way to receive instructions from an attacker. However, using a fixed domain for C&C communication is risky for attackers, as the security system can easily blacklist or take down the domain and interrupt the attack. To overcome this, malware authors implement DGA, which dynamically generates thousands of malicious domains (known as generative-domains) at regular intervals. The malware attempts to connect to malicious domains, and if the attacker registers one of them, the infected machine can establish a connection with the C&C server.

Detecting DGA-based threats presents several challenges. Rule-based approaches, such as those based on domain entropy or n-gram patterns [25,26], are limited in generalization and fail against adaptive DGAs. Classical ML methods like random forests and support vector machines [27] offer improved detection but depend on hand-crafted features and labeled data, making them susceptible to false positives and necessitating frequent retraining.

Deep learning models have demonstrated exciting potential in detecting DGA domains by learning sequential domain patterns directly from raw inputs. Techniques such as Convolutional Neural Networks (CNNs) [28,29] and Recurrent Neural Networks (RNNs) [30] achieve higher accuracy but introduce significant computational overhead and require large, labeled datasets, limiting their feasibility in resource-constrained or label-scarce environments.

In the context of FL, as illustrated in Figure 1, each IDS node is equipped with an AI operator capable of making predictions on the received traffic, classifying it as either benign or malicious domains. Each model is independently trained on local traffic data specific to its respective zone, eliminating the need for data sharing across zones. Subsequently, a Central Intelligence Server aggregates these locally trained models to produce a new global model. This process ensures that all zones are regularly updated and equipped with knowledge to detect emerging threats, such as new types of DGA attacks. While some studies have explored FL-based DGA detection [4], challenges remain regarding communication overhead, training convergence [7], and model generalization in non-IID settings. These issues motivate the development of lightweight, unsupervised, and data-aware approaches capable of detecting evolving DGA variants in federated IDS systems.

2.4. Research Gap and Motivation

Despite the promising advances in applying FL to IDS, several critical gaps remain unaddressed. First, most existing FL frameworks assume homogeneous or mildly non-IID data distributions and tend to overlook clients that operate on rare or unique data samples. This is particularly problematic in DGA detection, where new attack patterns often emerge in isolated environments and are difficult to detect without specialized data. Second, while prior works have explored client clustering techniques, often based on model parameters or inference similarity, these methods require substantial training efforts, rely on auxiliary datasets, and fail to accurately capture the intrinsic characteristics of raw input data. Third, client selection in FL is typically driven by computational efficiency, resulting in the exclusion of resource-constrained devices (“stragglers”). Unfortunately, these stragglers may possess crucial information about novel threats, and their exclusion may compromise the robustness of the global model.

In addition, traditional DGA detection models often require extensive labeled datasets and may not generalize well to previously unseen domain patterns. Given the difficulty of acquiring labeled DGA data in real-world IDS deployments, there is a pressing need for unsupervised methods that can effectively detect anomalies and support intelligent client selection.

To address these challenges, our work is motivated by the following core research questions:

(Q1) How can DGA attacks be effectively detected in a distributed environment while preserving data privacy?
(Q2) How can we mitigate non-IID data effects and ensure that even rare or novel DGA domains are learned by the global model?
(Q3) How can client selection be optimized to balance computational efficiency and data importance, especially under system heterogeneity?
(Q4) Is it feasible to perform novelty detection of DGA attacks in a label-free, unsupervised fashion?

These questions underpin the design of our proposed FedSAGE framework, which leverages latent representations from a pre-trained variational autoencoder to enable efficient client clustering and data-aware selection, ultimately improving both detection performance and system efficiency in FL-based IDS.

Traditional FL research mainly emphasizes convergence speed and data efficiency. In contrast, our work tackles a critical cybersecurity issue: how to collaboratively defend against adaptive DGA threats without relying on labeled datasets or centralized data sharing. This reflects the broader trend toward decentralizing threat intelligence while preserving operational privacy and resource constraints.

In summary, although various FL-based IDS frameworks and DGA detection methods have been proposed, they either incur high computational overhead, lack scalability to unseen attacks, or fail to integrate efficient client selection. These limitations drive the motivation for our work, which we detail in the next section.

3. The Proposed FedSAGE Framework for Federated DGA Detection

Building upon the challenges identified in Section 2, we propose FedSAGE (Federated Selection and Clustering for Anomaly and Generative-domain Evaluation), a novel framework that uniquely combines unsupervised latent space-based client clustering and computation-aware selection. Unlike prior works that rely on auxiliary datasets or parameter-based similarity, FedSAGE requires no labels and minimizes communication cost by leveraging the latent representations of a shared VAE. This section details the architectural and algorithmic design of our approach.

3.1. Variational Autoencoder Architecture

Autoencoders (AEs) [31] are neural network models widely used for unsupervised representation learning, especially in tasks that require dimensionality reduction or feature extraction. An AE learns to encode input data into a compressed latent vector and then reconstruct the original input from this compact representation. This self-supervised learning process makes it possible to capture essential patterns and structure in the original data. In clustering applications, the latent space produced by AEs often reflects meaningful groupings of data, making them particularly useful for discovering hidden structures or anomalies without supervision.

The Variational Autoencoder (VAE) [32] is introduced as an upgraded version of AE, which shows a probabilistic extension by modeling the latent space as a distribution rather than fixed points. Unlike standard AEs that map each input to a single latent vector, VAE maps inputs to a mean and variance vector, enabling sampling from a learned distribution. This methodology leads to a more continuous and regularized latent space, mapping similar inputs to nearby regions. The result is improved generalization, smooth interpolation between samples, and a latent space more suitable for clustering and anomaly detection.

As shown in Figure 2, the architecture of VAE consists of three primary components: encoder, probabilistic latent space, and decoder. The encoder, known as the probabilistic encoder, does not map input data to a fixed vector like in the AE formulation but instead estimates a probability distribution over the latent space. This change introduces uncertainty, which enhances the model’s ability to generate diverse yet coherent samples.

The latent space in VAE is designed as a continuous, probabilistic space rather than a deterministic bottleneck, distinguishing it from traditional autoencoders. Let a domain name x be encoded to obtain a latent mean

μ (x)

and variance

σ (x)

. VAE employs the reparameterization trick, which allows gradients to flow through the stochastic sampling process by re-expressing the sampled latent variable z as shown in Equation (1):

z = μ (x) + σ (x) \cdot ϵ, ϵ \sim N (0, I)

(1)

Equation (1) ensures that the latent variable remains differentiable, thereby allowing for efficient gradient-based optimization. Among these four variables, the latent vector

μ

is the only one that directly captures the characteristics of the input data within the latent space. Therefore, it is well-suited for use in unsupervised clustering algorithms [33,34] to determine the underlying type or distribution of the data. This makes

μ

particularly effective for tasks such as classifying DGA attack data and detecting previously unseen DGA variants in a fully unsupervised manner.

In this system, the VAE is first pre-trained exclusively on benign domain data. The objective is to construct a latent space that encapsulates the statistical and structural characteristics of legitimate domain names, such as typical domain lengths, entropy distributions, character composition, and morphological patterns (e.g., common suffixes like .com, .org). During training, the VAE learns to minimize the reconstruction error, typically a combination of mean squared error (MSE) for reconstruction fidelity and Kullback–Leibler (KL) divergence for latent regularization, allowing the model to accurately reproduce benign inputs from their latent representations.

Let

X_{benign} = {x_{1}, x_{2}, \dots, x_{N}}

denote a dataset of legitimate (benign) domain names. A VAE is trained on

X_{benign}

to learn a probabilistic generative model

p_{θ} (x | z)

, where z is a latent variable sampled from a prior distribution

p (z) = N (0, I)

. The encoder

q_{ϕ} (z | x)

approximates the true posterior and maps each input x into a distribution in latent space.

The VAE is optimized by minimizing the following evidence lower bound (ELBO):

L_{VAE} (x) = E_{q_{ϕ} (z | x)} [log p_{θ} (x | z)] - D_{KL} (q_{ϕ} (z | x) ∥ p (z)),

(2)

where:

The first term encourages accurate reconstruction of the input x from the latent variable z in Equation (1).
The second term regularizes the encoder’s output to remain close to the prior $p (z)$ .

In our implementation, as shown in Figure 2, the VAE model is realized using a character-level BiLSTM encoder and decoder, with the reconstruction loss computed via token-level cross-entropy (rather than MSE). The KL divergence term is scaled by batch size and added to the total loss as in Equation (2).

During training, only domain names labeled as benign (i.e.,

y = 0

) are used to construct the latent space. The VAE is trained end-to-end using the Adam optimizer, and the final reconstruction error

E_{rec} (x)

is computed per domain to support unsupervised anomaly detection.

Once trained, the encoder captures meaningful embedding of benign domains, while the decoder effectively reconstructs these inputs. When a new domain name, possibly generated by a DGA, is processed by the VAE, it is projected into the latent space and passed through the decoder. If the input domain differs significantly from the distribution learned during benign training, the reconstruction error (i.e., the difference between the input and the output) increases notably.

In practical scenarios, this approach offers several advantages:

1.: The model can detect malicious domains that have not been previously encountered, without requiring explicit DGA labels.
2.: The pre-trained encoder can be deployed on lightweight client devices, where it performs simple forward passes to generate domain embeddings, eliminating the need for on-device retraining.
3.: It significantly reduces computational and communication overhead in FL environments, particularly at the edge or in IoT settings.

3.2. Unsupervised VAE Clustering for Novelty Detection

3.2.1. Client Clustering Based on Latent Encoding

Building on the latent vector

μ

extracted from the pre-trained VAE, we perform an unsupervised client clustering approach, as illustrated in Figure 3. Before each local training round on client i, a VAE model shared from the server is made available. The local training data on client i is passed through the VAE to obtain latent vectors

μ

. These latent vectors are then averaged into a representative value

\bar{μ_{i}}

, which characterizes the data distribution of client i. Once all

\bar{μ_{i}}

values from all clients are collected, a clustering algorithm is applied to group clients that share similar data characteristics.

In FedSAGE, we adopt Affinity Propagation (AP) [35] as the client clustering method due to its unique advantages in handling latent representations of non-IID data. Unlike traditional clustering algorithms such as K-means, which require the number of clusters to be predefined and assume spherical distributions, AP operates in a fully unsupervised manner and automatically determines the number of clusters based on pairwise similarity scores.

AP operates on a similarity matrix

S \in R^{N \times N}

, where

S (i, k)

indicates the similarity between data points i and k. In our context, each data point corresponds to a client’s latent vector

μ_{i}

extracted from the VAE, and the similarity is defined as the negative squared Euclidean distance:

S (i, k) = - ∥ μ_{i} - μ_{k} ∥^{2}

(3)

The algorithm iteratively updates two types of messages between data points:

Responsibility $r (i, k)$ : how well-suited point k is to be the exemplar for point i,
Availability $a (i, k)$ : how appropriate it would be for point i to choose point k as its exemplar.

These messages are updated using the following:

r (i, k) \leftarrow S (i, k) - max_{k^{'} \neq k} \{a (i, k^{'}) + S (i, k^{'})\}

(4)

a (i, k) \leftarrow min \{0, r (k, k) + \sum_{i^{'} \notin {i, k}} max {0, r (i^{'}, k)}\}

(5)

The diagonal element

r (k, k) + a (k, k)

reflects the evidence that point k should be an exemplar. After convergence, each point is assigned to the exemplar k that maximizes this combined value.

In our FedSAGE framework, Equations (4) and (5) allow clients with similar latent features to naturally form clusters without predefining the number of groups, while clients most central to each cluster (exemplars) serve as suitable representatives for training. This makes it ideal for dynamic FL settings, where data distributions are unknown and continuously changing. AP clusters clients by iteratively exchanging real-valued messages between data points until convergence, effectively identifying exemplars that represent the most central members of each cluster. These exemplars are especially valuable in FL, as they can act as representative clients for both training and analysis.

Furthermore, AP is well-suited for high-dimensional latent spaces generated by the VAE, where client data may not be linearly separable or uniformly distributed. It avoids reliance on explicit distance thresholds or centroid initialization, making it more robust and adaptive to the statistical variability present in federated IDS systems.

In addition to detecting novel DGA attacks by conducting distance-based analysis within the latent space, new or unknown samples are encoded into the same latent space, and their distances to existing cluster centroids or distributions are calculated. Samples that lie far from known clusters are identified as potential novel threats, enabling effective anomaly detection based on learned representations.

3.2.2. Privacy-Preserving Latent Encoding

Although FedSAGE does not transmit raw inputs, one might ask whether the transmitted VAE latent vectors-

μ

of each client (as shown in Figure 3) could still be reverse-engineered to recover sensitive domain names. FedSAGE assures that the VAE inherently ensures non-reversibility of the latent representation due to the following reasons:

1.: The encoder maps input domain strings $x \in X$ to a low-dimensional latent mean vector $μ (x) \in R^{d}$ , typically with $d ≪ | x |$ . This process is non-invertible by design, as multiple inputs are projected to nearby or overlapping regions in latent space. The dimensionality reduction forces the model to compress only the most semantically relevant aspects of the input. This is aligned with the information bottleneck principle, which suggests that $μ$ retains information that is useful for downstream tasks (e.g., clustering) while discarding instance-specific noise or details that might enable reconstruction.
2.: During training, VAE introduces Gaussian noise into the encoding through the reparameterization trick in Equation (1). Even after training, the encoder learns to focus on capturing latent factors of variation, not full input recovery. Therefore, even if one were to recover a valid latent point $μ$ , it is impossible to deterministically recover x, especially for non-deterministic inputs such as randomized domain names.
3.: The decoder network used during training is not deployed or shared in the federated setup. The server only observes isolated latent vectors from independent clients. Even with white-box access, reconstructing a valid decoder without aligned $μ$ -x pairs from each client is infeasible due to a lack of supervision and data pairing.

Otherwise, prior work in representation learning (e.g., Triastcyn et al. [36]) supports that VAE latent spaces are robust to inversion, particularly when regularized with KL-divergence and trained without reconstruction fidelity as the primary goal. Our encoder is trained with minimal capacity to discourage memorization, further reducing invertibility risk.

In summary, the latent vector

μ

serves as a semantic signature, not a reconstructible representation. While we acknowledge that empirical verification (e.g., decoder attack) can further reinforce this claim, our current use of lossy, many-to-one, stochastic encoding with no decoder availability provides a strong theoretical foundation for its privacy-preserving nature.

3.3. Client Selection with Data-Aware Scheduling

In practical FL deployments, clients exhibit substantial heterogeneity in terms of computation power, energy availability, and network bandwidth. This results in the well-known straggler problem, where slow clients delay aggregation and prolong convergence. To mitigate this, many existing methods adopt a computation-centric client selection policy, wherein only clients with high processing speed or sufficient resources are included in each training round. While this improves training efficiency, it introduces the risk of excluding clients that hold rare or critical data, a significant issue in intrusion detection scenarios where novel threats, such as DGA-based attacks, may only appear in specific regions or under certain traffic patterns.

FedSAGE addresses this challenge by coupling data-aware clustering with efficient client scheduling. After deriving latent vectors (

μ

) from the pre-trained VAE, clients are grouped into clusters based on the semantic similarity of their local data. Within each cluster, only the client with the lowest estimated execution time is selected to participate in the training round. Formally, let

C_{1}, \dots, C_{K}

denote the K clusters. For each client cluster

C_{k}

, we select the top

ρ %

fastest clients based on estimated execution time. The selected subset

S_{k}

is defined as follows:

S_{k} = \{c \in C_{k} | {Rank}_{C_{k}} (ExecTime (c)) \leq ⌊ρ \cdot n_{k}⌋\}

(6)

where

ExecTime (c)

denotes the expected local training time for client c—estimated based on prior communication rounds or resource profiles.

n_{k} = | C_{k} |

and the rank is computed in ascending order of execution time.

This strategy ensures that each latent data region is represented in every training round while minimizing the overall training delay. Unlike computation-only selection strategies, FedSAGE maintains semantic diversity across clients and avoids the blind spots caused by excluding slower but potentially more informative nodes, also introducing fault tolerance, ensuring that the failure or dropout of a single client does not compromise training progression. Furthermore, FedSAGE reduces communication overhead, making it suitable for deployment in resource-constrained environments such as edge-based IDS systems.

3.4. FedSAGE—Federated Selection and Clustering for Anomaly and Generative-Domain Evaluation

After clustering the devices as described in Section 3.2 to address data heterogeneity caused by non-IID distributions and simultaneously mitigating system heterogeneity by filtering out straggler devices in Section 3.3, we can define the complete operational workflow of the FedSAGE algorithm, which comprises three core components:

Representation Learning: Using a shared pre-trained VAE to encode client data into a meaningful latent space.
Client Clustering: Grouping clients via affinity propagation based on latent representations to capture data similarity.
Efficient Selection: Selecting the fastest client(s) within each cluster to ensure training efficiency without sacrificing data diversity.

The proposed FedSAGE algorithm shown in Algorithm 1 and visualized in Figure 4 is designed to address both non-IID data and system heterogeneity in FL environments by combining data-aware clustering with computation-aware client selection. At the beginning of each training round, every client uses a shared VAE to map its local data into a latent representation. From each client data point, latent vectors (

μ

) are computed and averaged from all local data to make the mean latent vector

\bar{μ_{i}}

, which captures the overall characteristics of its data. Then, this latent value is clustered using the affinity propagation algorithm, grouping clients with similar data distributions into clusters

C_{1}, C_{2}, \dots, C_{k}

.

Algorithm 1 FedSAGE: Federated Selection and Clustering for Anomaly and Generative-domain Evaluation

1:: Input: Total clients N, selection ratio $ρ (%)$ , total training rounds T
2:: Initialize global model parameters w
3:: for each round $t = 1$ to T do
4:: // Step 1: Encode local data into latent vectors
5:: for each client $i \in {1, \dots, N}$ do
6:: Use shared VAE to compute latent vectors $μ_{i}$ from local data
7:: Compute mean latent vector $\bar{μ_{i}}$ for client i
8:: end for
9:: // Step 2: Cluster clients based on $\bar{μ_{i}}$
10:: Perform Affinity Propagation clustering on ${\bar{μ_{i}}}_{i = 1}^{N}$
11:: Let $C_{1}, C_{2}, \dots, C_{k}$ be the resulting client clusters
12:: // Step 3: Select $ρ$ fastest clients in each cluster
13:: for each cluster $C_{j}$ do
14:: Estimated the training time of each client
15:: Select top $ρ (%)$ fastest clients from $C_{j}$ as $S_{j}$ (From Equation (6))
16:: end for
17:: // Step 4: Local training
18:: Aggregate selected clients $S = ⋃_{j = 1}^{k} S_{j}$
19:: for each client $i \in S$ in parallel do
20:: Perform local training on the client’s dataset
21:: Send updated model parameters to the server
22:: end for
23:: // Step 5: Aggregation
24:: Server aggregates updates and updates the global model w
25:: end for

Within each cluster, the computation speed of every client is measured. The top

ρ

fastest clients in each cluster are selected to participate in the current training round, forming a selected set

S = ⋃_{j = 1}^{k} S_{j}

. These clients perform local training using their private datasets and send the updated model parameters back to the central server. The server then aggregates the received updates to refine the global model w.

By incorporating both data representation and device capability into the selection process, FedSAGE ensures that important data, such as traces of new DGA attacks, is not excluded due to limited device performance while also improving training efficiency and energy savings. This makes it a suitable strategy for intelligent and sustainable FL systems.

In real-world federated environments, client reliability and stability are often nondeterministic. While our method assumes honest but partially active clients, we implement simple yet effective mechanisms to handle common failure modes. Since clustering and model aggregation are performed only on a selected subset of clients in each round, the failure of a few clients reduces the participation set. The server proceeds with aggregation using the successfully returned latent vectors, as done in standard FL.

4. Experiment Setup

4.1. DGA Datasets and Detection Models

For the preparation of training AI models, we collected domain name datasets and organized them into two distinct categories. First is benign domains that were sourced from the Cisco Umbrella Top 1 Million list (http://s3-us-west-1.amazonaws.com/umbrella-static/index.html, accessed on 8 March 2025), which contains the most frequently accessed and widely used domains on the Internet. The second is DGA domains—known as malicious domains gathered from open-source GitHub repositories (https://github.com/baderj/domain_generation_algorithms, tag version 1.0.0, accessed on 22 December 2024), covering 20 different DGA families, with approximately 50,000 domain names per family. We randomly divided the dataset into four separate zones to simulate real-world non-IID scenarios, where each IDS node may be exposed to different DGA families. The detailed distribution of domain types across these zones is summarized in Table 2.

To effectively detect domain names generated by DGA, we implemented and evaluated three deep learning models: CNN [29], BiLSTM [37], and Transformer [38]. Each model is designed to capture different aspects of the sequential patterns inherent in domain names:

CNN Model: Built by a multiple-layer convolutional architecture to capture local n-gram patterns in domain names. It employs an embedding layer followed by four parallel 1D convolutional layers with kernel sizes of ${3, 5, 7, 9}$ , each with 128 filters. Then, these outputs are pooled via adaptive max pooling and concatenated before passing through fully connected layers.
BiLSTM Model: This uses a bidirectional LSTM encoder with a hidden dimension of 64 in each direction (128 total), followed by an attention mechanism that computes a weighted sum of hidden states. The attention-enhanced representation is then passed through batch-normalized dense layers for classification. This architecture is effective for modeling sequential correlations inherent in algorithmically generated domain names.
Transformer Model: This model uses the global attention method for modeling input sequences, enabling parallel computation and strong generalization. It uses an embedding layer of size 64 with added positional encoding, followed by a stack of 4 Transformer encoder layers with 4 heads and a feedforward dimension of 256. Then, this final output is taken from the first token’s transformed representation, similar to BERT-style classification [39]. This model is the most computationally intensive among the three, but it demonstrates strong performance in capturing long-range semantic patterns.

These three models provide complementary strengths: CNNs are efficient and local-pattern focused; BiLSTMs capture temporal dynamics encoding; and Transformers offer strong global attention capabilities, allowing us to rigorously evaluate the robustness of FedSAGE under different neural representations. Collectively, they allow for robust and comprehensive detection of a wide variety of DGA-based threats. However, due to the fundamentally different neural network architectures, the training speed varies significantly across models. Each model has distinct computational complexity and convergence behavior, which leads to different training times even under the same system conditions.

4.2. Testbed Setup

Figure 5 illustrates the operational architecture of our FedSAGE framework deployed in a federated learning environment with two phases: cluster and selection with green lines and components, and aggregation with blue lines and components. Each client (e.g., virtualized IDS node) holds private domain data and performs a local forward pass through a shared pre-trained VAE to generate a latent representation vector

μ_{i}

.

These latent vectors

{μ_{1}, μ_{2}, \dots, μ_{n}}

are transmitted to the central server, which performs affinity propagation clustering to group clients with similar data distributions. As argued in Section 3.2.2, these latent representations are non-invertible and thus preserve data privacy. Within each cluster, clients are selected for local training based on their estimated execution time profiles. Selected clients then perform local model updates and return the trained weights to the server. Otherwise, if clients have not been selected (as shown with a dashed line), it will not perform this local training round.

With the blue flow, the server aggregates the received model parameters to update the global model, which is broadcast to all clients for the next round. This iterative process ensures both data diversity and training efficiency while minimizing communication and computation overhead in heterogeneous environments.

The details of the architecture of the VAE model are as follows. The encoder begins with an embedding layer that maps input tokens (representing characters of a domain name) into 16-dimensional vectors, then passes through a bidirectional LSTM layer with 64 hidden units in each direction. The final hidden states from both directions are concatenated into a 128-dimensional vector and processed through two fully connected layers to produce the mean and log-variance vectors, which parameterize the latent space. A latent vector is then sampled from this distribution using the standard reparameterization approach. The decoder reconstructs the input domain sequence from the sampled latent vector. It first projects the latent vector into an initial hidden state for a unidirectional LSTM decoder. This LSTM receives a sequence of start tokens (zeros) and generates an output sequence with the same length as the input. Each output step predicts a token using a fully connected layer over the vocabulary space. To deploy the three deep learning models (CNN, BiLSTM, and Transformer) mentioned in Section 4.1 and the VAE model in the actual testbed, we use the PyTorch framework with the Python programming language. The experimental setup includes the system configuration summarized in Table 3.

For energy modeling, we approximate the per-round energy of client i as follows:

E_{i} = P_{i d l e} t_{i d l e} + P_{c} t_{c} + P_{t x} \frac{B_{i}}{R_{i}} .

(7)

Equation (7) breaks the total energy consumed by client i in a single FL round into three intuitive components:

Idle energy— $P_{i d l e} t_{i d l e}$ . Even when no training is underway, the device draws a baseline (static) power $P_{i d l e}$ to keep fans, regulators, and memory alive. Multiplying this idle power by the time the device remains idle, $t_{i d l e}$ , yields the unavoidable “background” energy budget.
Computation energy— $P_{c} t_{c}$ . During training, each compute resource c (CPU and GPU) operates at an average dynamic power $P_{c}$ ; multiplying by its active time $t_{c}$ gives the energy devoted purely to model computation.
Communication energy— $P_{t x} \frac{B_{i}}{R_{i}}$ . Exchanging model parameters and gradients forces the network interface to transmit $B_{i}$ bytes at a data rate of $R_{i}$ . The term $B_{i} / R_{i}$ is the transmission time; when multiplied by the NIC’s transmit power $P_{t x}$ , it captures the energy spent on networking.

Because all power coefficients

(P_{i d l e}, P_{c}, P_{t x})

are pre-profiled using the powertop command for the power meter and remain nearly constant during the experiments, Equation (7) offers a fast, sensor-free way to predict per-round energy from easily observable variables (execution times and payload size). The scheduler can therefore optimize client selection simultaneously for latency and energy.

4.3. Baseline

To comprehensively evaluate the effectiveness of our proposed FedSAGE framework, we compare it against the following baseline FL strategies:

FedAvg [5]: The standard FL algorithm in which all clients participate in each training round. Clients perform local training using their private data and send the updated model parameters to a central server, which aggregates them using weighted averaging. FedAvg does not account for system heterogeneity or data distribution skewness.
FedRand (Random Selection): A strategy where a fixed number of clients is randomly selected in each round. This method reduces the training time by avoiding straggler clients but may overlook important data held by randomly excluded clients, especially under non-IID conditions.
FedSpeed (Only Selection): This method selects only clients with high computational performance (e.g., faster devices) in each training round. While this improves system efficiency, it may exclude slower devices that possess critical or unique data, potentially hurting model generalization in non-IID settings.
FLIS (Federated Learning via Inference Similarity) [11]: A clustered FL approach that groups clients based on inference similarity (i.e., the similarity of model outputs on a shared auxiliary dataset). Although FLIS effectively addresses data heterogeneity, it introduces additional overhead by requiring an auxiliary dataset for server-side evaluation and similarity computation.

For selection algorithms, devices are selected with selection ratio

ρ = 30 (%)

. These strategies serve as comparative baselines to highlight how FedSAGE balances both training efficiency and data relevance through intelligent client clustering and selection.

While our evaluation focuses on FL methods, we acknowledge that the frameworks we consider are general-purpose FL systems. Existing studies that explore FL in conjunction with domain-specific applications, such as DGA detection or IDS [4,6,7,21,22], primarily provide broad surveys or preliminary insights into the applicability of FL in security contexts. However, these works do not present strong contributions in key technical aspects that are central to our investigation, namely convergence speed optimization, intelligent client selection, and energy-aware system design. They lack systematic experimentation or in-depth evaluation along these axes. For this reason, we do not consider these studies as meaningful baselines for our work. Including them would dilute the focus of our evaluation without offering a fair or technically relevant point of comparison.

5. Results

5.1. Clustering Efficiency

In Section 3.2, the local data on each client is passed through a shared VAE model to extract meaningful latent representations. These representations are then used as inputs to an unsupervised clustering algorithm to group clients based on data similarity. To evaluate the effectiveness of this clustering process, Figure 6 presents the clustering results obtained using the affinity propagation algorithm, visualized in two-dimensional space via Principal Component Analysis (PCA) for dimensionality reduction.

As shown in Figure 6, clients are clearly separated into four clusters, reflecting the original division of the dataset into four zones, as described in Table 2. Furthermore, the final clustering outcome perfectly corresponds to the actual distribution of each client’s local private data. Clients whose data belonged to a specific zone are grouped accurately into the same cluster without any prior knowledge of the number of clusters or DGA attack types.

This result demonstrates that the proposed client clustering method based on VAE latent vectors achieves high accuracy in reflecting the underlying data characteristics in a fully unsupervised manner. The approach is particularly practical for real-world scenarios, enabling IDS nodes to detect and adapt to previously unseen DGA attacks without the need for labeled data or predefined DGA family information.

5.2. Distributed Training Performance

The experimental results shown in Table 4 and Table 5 present a comparison of Top-1 test accuracy achieved by different FL algorithms under different training durations. This demonstrates that FedSAGE consistently outperforms all baseline methods in terms of both accuracy and convergence speed across different deep learning models and data distribution settings.

Under uniform data with the CNN model, FedSAGE reaches the highest accuracy of 87.53% within 800 s, significantly outperforming FedAvg (76.69%) and FedSpeed (65.26%). In the non-uniform scenario, FedSAGE again achieves the best result, with 87.85%, showing superior robustness to non-IID data compared to other methods such as FedAvg (77.85%) and FedRand (78.48%).

With the BiLSTM model, FedSAGE attains its peak accuracy of 89.33% in just 500 s for uniform data, while other methods such as FedAvg and FedSpeed achieve lower accuracy within the same timeframe. In the non-uniform case, FedSAGE further improves to 89.88%, demonstrating its ability to quickly converge while maintaining high performance even with heterogeneous data.

For the Transformer model, which generally requires longer training times, FedSAGE still achieves the highest accuracy in both settings. It reaches 76.09% under uniform and 80.32% under non-uniform data after 1500 s, outperforming all other approaches, including FedSpeed and FLIS.

Overall, these results confirm that FedSAGE not only accelerates the training process but also achieves superior accuracy, especially in realistic non-IID scenarios. Its ability to effectively cluster clients and select computationally suitable participants allows it to strike an optimal balance between efficiency and performance.

5.3. Client Performance Analysis

The deployment of FedSAGE requires each client to locally perform inference using a pre-trained VAE in order to extract mean latent vectors

\bar{μ_{i}}

. This introduces additional computational overhead compared to conventional FL frameworks, as clients must execute an extra forward pass through the VAE. To assess the practicality of this design, especially in edge environments with limited resources, we evaluate the lightweight nature of the VAE component in terms of both runtime and memory consumption. Specifically, we compare the VAE’s inference time and peak RAM usage against the baseline DNN models used for DGA classification (CNN, BiLSTM, and Transformer), across clients with 2, 4, 6, and 8 virtual CPU cores. This analysis demonstrates that the integration of the VAE imposes minimal burden on client devices, ensuring that FedSAGE remains compatible with typical federated learning infrastructures.

As shown in Figure 7, clients with fewer cores experience significantly higher execution times. For example, training the Transformer model takes over 23 s on a 2-core VM, while it only takes 11.48 s on an 8-core VM. BiLSTM and CNN exhibit similar trends. In contrast, inference with the VAE remains lightweight across all configurations, with average execution time below 1 s, even on the lowest-resource devices. This confirms that latent extraction can be executed in near real time without affecting system responsiveness. Furthermore, since clustering and selection are performed only once per training round, not per sample, the system supports batch-based operation and asynchronous training, making FedSAGE practical for real-world, low-latency federated inference pipelines, especially for edge and IoT devices.

These results validate the effectiveness of FedSAGE’s client selection strategy outlined in Section 3.3. By selecting the top

ρ %

fastest clients in each data cluster, the framework effectively avoids performance degradation due to stragglers. This approach ensures both training efficiency and fair representation of heterogeneous data sources.

In addition to execution time, we evaluated the memory consumption of each model during local training and inference. Table 6 reports the average RAM usage (in GB) measured on client devices for the three deep learning models and the VAE.

The results show that all of the models maintained relatively low memory footprints, with Transformer being the most memory-intensive at 0.68 GB, followed closely by BiLSTM and CNN. The VAE, used for latent vector extraction and clustering, had the smallest memory demand at only 0.10 GB, making it well-suited for deployment on edge devices with limited resources.

These findings further support the practical applicability of FedSAGE in resource-constrained environments, where both compute time and memory usage must be considered in client selection and system design. This is well suited for the system architecture built with VAE, as shown in Section 3.2.

5.4. Energy Efficiency

Table 7 shows the energy consumption (in Joules) across different deep learning models and data distributions required by each FL algorithm to reach the accuracy threshold of 80%. Note that results showing a “-” value indicate that the system failed to converge after the maximum of 400 training rounds we conducted. Overall, the results demonstrate that FedSAGE consistently consumes the least amount of energy, highlighting its efficiency in both training and resource utilization.

For the CNN model, under uniform data, FedSAGE achieves 80% accuracy with just 16,812.44 J, significantly outperforming other methods such as FedAvg (273,348.52 J), FedRand (41,486.30 J), and FLIS (50,673.33 J). Notably, FedSpeed is unable to reach the accuracy threshold even after 400 training rounds, indicated by the value “-”. In the non-uniform scenario, FedSAGE only requires 9484.04 J, again showing the best performance, while the other methods consume two to five times more energy.

Similar trends are observed with the BiLSTM model. For uniform data, FedSAGE consumes 6834.96 J, compared to FedAvg (33,878.16 J) and FLIS (48,610.55 J). In the non-uniform case, FedSAGE consumes only 6002.80 J, the lowest among all approaches.

For the more complex Transformer model, FedSAGE remains the most energy efficient. Under uniform distribution, it requires 6560.83 J, while FedRand and FLIS consume over 150,000 J. In the non-uniform setting, FedSAGE achieves 80% accuracy with 53,835.56 J, whereas all other methods require significantly more energy—FedAvg, for example, exceeds 880,000 J. Again, several methods (including FedSpeed and FLIS in certain cases) fail to reach the 80% accuracy threshold even after exceeding the maximum of 400 rounds.

By achieving both high accuracy and significant energy savings, FedSAGE proves to be a practical solution for deployment in real-world, resource-limited environments such as edge computing and IoT systems.

5.5. Discussion and Analysis

The design and implementation of FedSAGE directly address the four core research questions proposed in Section 2.4:

(Q1) Distributed DGA detection with privacy: We utilize a pre-trained VAE deployed for each client to map raw domain data into latent vectors without sharing raw data. These vectors are used for downstream clustering and anomaly detection, preserving privacy while enabling detection across decentralized IDS nodes (see Section 3.1, Section 3.2 and Section 4.1).
(Q2) Handling non-IID and rare DGA data: The use of affinity propagation clustering on VAE-derived latent vectors allows clients with similar data characteristics to be grouped together, ensuring semantic diversity is preserved in training (Section 3.2). Moreover, the experiments in Section 5.1 demonstrate that the clustering aligns with true DGA zones, even under non-uniform data.
(Q3) Intelligent client selection under system heterogeneity: FedSAGE selects the top-ρ% fastest clients in each cluster (Section 3.3), combining data-aware and computation-aware strategies. This ensures participation from diverse data regions without incurring delays from stragglers. The result is a balance between efficiency and representativeness (see Algorithm 1, Table 4 and Table 5).
(Q4) Unsupervised novelty detection of DGA: The VAE, trained only on benign domains, enables unsupervised anomaly detection in the latent space by identifying DGA domains that deviate from benign clusters (Section 3.1 and Section 3.2). This eliminates the need for labeled DGA data, as shown in the clustering and accuracy results (Figure 6, Table 4 and Table 5).

Beyond answering the research questions, FedSAGE demonstrates:

Robust convergence with low communication cost (see Section 5.1 and Section 5.2),
Strong and lightweight adaptability to real-world non-IID data (see Section 5.3),
Superior energy efficiency under strict system constraints (see Section 5.4).

While our current testbed supports up to 30 clients, FedSAGE is inherently scalable. Each client transmits only a compact latent vector, and clustering is performed over these low-dimensional embeddings. This design significantly reduces computational overhead, enabling the system to accommodate a large number of clients with minimal additional cost. Moreover, since each client trains independently on its local data, the overall system scales efficiently. As long as a distributed orchestration framework is employed, the number of clients can be increased without increasing per-client computational load or imposing a significant burden on the server. These properties make FedSAGE especially suitable for deployment in edge-based IDS systems and potentially generalizable to other anomaly detection problems.

However, despite its scalable design, we acknowledge several practical challenges that may arise in large-scale FL deployments. First, communication delays can become non-negligible when orchestrating updates from a large and geographically distributed client pool, especially under constrained or unstable network conditions. Although FedSAGE is a lightweight method that transmits only latent vectors, resulting in significantly lower communication cost compared to standard FL approaches, it can be further scaled to larger systems by incorporating transmission compression or asynchronous updates. Second, synchronization bottlenecks can occur when slow or unreliable clients delay the aggregation phase. FedSAGE partially addresses this through resource-aware client selection, but a more thorough analysis of gradient optimization strategies or adaptive waiting time allocation could further improve robustness. Third, the current client clustering relies on affinity propagation, which is favored thanks to its high clustering quality. However, it faces scalability challenges as the number of clients increases. To maintain efficiency, the clustering algorithm may need to trade off a small amount of accuracy for lower computational complexity, such as by using approximate clustering or hierarchical clustering methods. Recognizing and mitigating these system-level limitations will be essential for extending FedSAGE to production-scale deployments.

6. Conclusions and Future Works

In this paper, we introduced FedSAGE—a novel federated learning framework that enhances intrusion detection systems through variational autoencoder-based clustering without requiring labeled data and novelty DGA detection from clients. Our approach addresses two critical challenges in federated learning: non-IID data distribution and system heterogeneity. By using the latent space representations through VAEs, FedSAGE effectively clusters clients, facilitating the robust detection of novel DGA attacks.

Extensive experiments conducted on realistic multi-zone DGA datasets demonstrate that FedSAGE consistently outperforms traditional FL methods in terms of detection accuracy, clustering quality, convergence speed, and energy consumption. Additionally, with its lightweight capability in terms of capacity and execution time, our framework makes it especially well-suited for deployment in resource-constrained environments such as edge computing and IoT networks. Notably, the VAE used for client-side latent extraction achieves sub-second inference time even on 2-core CPUs, ensuring that feature extraction can be performed in real time without disrupting device responsiveness. This makes FedSAGE particularly suitable for time-sensitive federated security tasks such as malware detection, intrusion prevention, and traffic anomaly identification in edge environments.

While the results are promising, several limitations remain. First, our experimental testbed is constrained to a limited-scale emulation environment due to laboratory hardware availability. Scalability to thousands of edge devices and cross-site evaluation on real-world infrastructure would further validate the applicability of FedSAGE in practical deployments. Second, our current pipeline assumes honest clients and does not explicitly address the problem of invalid or adversarial latent vectors

μ

, the VAE model, and the training model. In future versions, we aim to incorporate robust latent filtering mechanisms, such as certified embedding validators or outlier detection layers, to ensure the trustworthiness of the shared representations.

FedSAGE offers a practical and scalable solution extendable to various anomaly detection tasks beyond DGA threats. For example, our framework can be extended to Botnet and C2 traffic detection, IoT behavior profiling, federated fraud detection, federated medical diagnostics, and others. In future research, we will specifically focus on optimizing client selection strategies by investigating suitable selection ratios under diverse operational scenarios, such as client computation or energy consumption. In addition, we will develop FL systems that enhance the security and reliability of latent vectors and model parameters against adversarial attacks. This direction is motivated by the observed sensitivity of FL performance to client selection ratios, underscoring the potential improvements in efficiency and accuracy that targeted optimization could deliver.

Author Contributions

Conceptualization, V.H. and N.H.T.; methodology, M.V.D.; software, P.M.D., T.T.P. and T.D.T.; validation, M.V.D. and T.D.T.; formal analysis, M.V.D. and N.H.T.; investigation, M.V.D. and T.D.T.; resources, M.V.D.; data curation, M.V.D.; writing—original draft preparation, M.V.D., P.M.D. and T.T.P.; writing—review and editing, M.V.D., V.H. and N.H.T.; visualization, M.V.D., P.M.D. and T.T.P.; supervision, N.H.T. and V.H.; project administration, M.V.D.; funding acquisition, N.H.T. All authors have read and agreed to the published version of the manuscript.

Funding

This work was funded by Hanoi University of Science and Technology (HUST) under project no. T2023-PC-038.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Liao, H.J.; Lin, C.H.R.; Lin, Y.C.; Tung, K.Y. Intrusion detection system: A comprehensive review. J. Netw. Comput. Appl. 2013, 36, 16–24. [Google Scholar] [CrossRef]
Sood, A.K.; Zeadally, S. A taxonomy of domain-generation algorithms. IEEE Secur. Priv. 2016, 14, 46–53. [Google Scholar] [CrossRef]
Fu, Y.; Yu, L.; Hambolu, O.; Ozcelik, I.; Husain, B.; Sun, J.; Sapra, K.; Du, D.; Beasley, C.T.; Brooks, R.R. Stealthy domain generation algorithms. IEEE Trans. Inf. Forensics Secur. 2017, 12, 1430–1443. [Google Scholar] [CrossRef]
Minh, N.N.; Hieu, P.T.; Hai, V.; Thanh, N.H. DGA-based Intrusion Detection System using Federated Learning Method on Edge Devices. In Proceedings of the 2024 International Conference on Information Networking (ICOIN), Ho Chi Minh City, Vietnam, 17–19 January 2024; pp. 509–514. [Google Scholar] [CrossRef]
McMahan, B.; Moore, E.; Ramage, D.; Hampson, S.; Arcas, B.A. Communication-efficient learning of deep networks from decentralized data. In Proceedings of the Artificial Intelligence and Statistics, PMLR, Fort Lauderdale, FL, USA, 20–22 April 2017; pp. 1273–1282. [Google Scholar]
Aouedi, O.; Piamrat, K.; Muller, G.; Singh, K. FLUIDS: Federated Learning with semi-supervised approach for Intrusion Detection System. In Proceedings of the 2022 IEEE 19th annual consumer communications & networking conference (CCNC), Las Vegas, NV, USA, 8–11 January 2022; pp. 523–524. [Google Scholar] [CrossRef]
Duc, M.V.; Luan, N.T.; Tai, N.T.; Hieu, N.P.T.; Minh, N.N.; Hieu, P.T.; Hai, V.; Thanh, N.H. On the Impact of Heterogeneity on Federated Learning at the Edge with DGA Malware Detection. In Proceedings of the Asian Internet Engineering Conference 2024, Sydney, Australia, 9 August 2024; pp. 10–17. [Google Scholar]
Tian, P.; Liao, W.; Yu, W.; Blasch, E. WSCC: A weight-similarity-based client clustering approach for non-IID federated learning. IEEE Internet Things J. 2022, 9, 20243–20256. [Google Scholar] [CrossRef]
Bo, L.; Ping, Z.Y.; Cai, L.Q. FedPVD: Clustered Federated Learning with NoN-IID Data. In Proceedings of the 2023 IEEE 6th International Conference on Electronic Information and Communication Technology (ICEICT), Qingdao, China, 21–24 July 2023; pp. 551–556. [Google Scholar] [CrossRef]
Shih, C.H.; Kuo, J.J.; Sheu, J.P. Information-Exchangeable Hierarchical Clustering for Federated Learning with Non-IID Data. In Proceedings of the GLOBECOM 2023—2023 IEEE Global Communications Conference, Kuala Lumpur, Malaysia, 4–8 December 2023; pp. 231–236. [Google Scholar] [CrossRef]
Morafah, M.; Vahidian, S.; Wang, W.; Lin, B. Flis: Clustered federated learning via inference similarity for non-iid data distribution. IEEE Open J. Comput. Soc. 2023, 4, 109–120. [Google Scholar] [CrossRef]
Li, Q.; Shao, S.; Yang, C.; Chen, J.; Qi, F.; Guo, S. Communication-efficient Federated Learning Framework with Parameter-Ordered Dropout. In Proceedings of the 2024 27th International Conference on Computer Supported Cooperative Work in Design (CSCWD), Tianjin, China, 8–10 May 2024; pp. 1195–1200. [Google Scholar] [CrossRef]
Alsamiri, J.; Alsubhi, K. Federated learning for intrusion detection systems in internet of vehicles: A general taxonomy, applications, and future directions. Future Internet 2023, 15, 403. [Google Scholar] [CrossRef]
Bharati, S.; Mondal, M.R.H.; Podder, P.; Prasath, V.S. Federated learning: Applications, challenges and future directions. Int. J. Hybrid Intell. Syst. 2022, 18, 19–35. [Google Scholar] [CrossRef]
Kim, H.; Kim, B.; Kim, Y.; You, C.; Park, H. K-fl: Kalman filter-based clustering federated learning method. IEEE Access 2023, 11, 36097–36105. [Google Scholar] [CrossRef]
Xiao, P.; Cheng, S.; Stankovic, V.; Vukobratovic, D. Averaging is probably not the optimum way of aggregating parameters in federated learning. Entropy 2020, 22, 314. [Google Scholar] [CrossRef] [PubMed]
Liu, H.; Lang, B. Machine learning and deep learning methods for intrusion detection systems: A survey. Appl. Sci. 2019, 9, 4396. [Google Scholar] [CrossRef]
Onietan, C.I.O.; Martins, I.; Owoseni, T.; Omonedo, E.C.; Eze, C.P. A preliminary study on the application of hybrid machine learning techniques in network intrusion detection systems. In Proceedings of the 2023 International Conference on Science, Engineering and Business for Sustainable Development Goals (SEB-SDG), Omu-Aran, Nigeria, 5–7 April 2023; Volume 1, pp. 1–7. [Google Scholar] [CrossRef]
Haripriya, L.; Jabbar, M.A. Role of machine learning in intrusion detection system. In Proceedings of the 2018 Second International Conference on Electronics, Communication and Aerospace Technology (ICECA), Coimbatore, India, 29–31 March 2018; pp. 925–929. [Google Scholar] [CrossRef]
Lee, S.W.; Sidqi, H.M.; Mohammadi, M.; Rashidi, S.; Rahmani, A.M.; Masdari, M.; Hosseinzadeh, M. Towards secure intrusion detection systems using deep learning techniques: Comprehensive analysis and review. J. Netw. Comput. Appl. 2021, 187, 103111. [Google Scholar] [CrossRef]
Agrawal, S.; Sarkar, S.; Aouedi, O.; Yenduri, G.; Piamrat, K.; Alazab, M.; Bhattacharya, S.; Maddikunta, P.K.R.; Gadekallu, T.R. Federated learning for intrusion detection system: Concepts, challenges and future directions. Comput. Commun. 2022, 195, 346–361. [Google Scholar] [CrossRef]
Khraisat, A.; Alazab, A.; Singh, S.; Jan, T.; Gomez, A., Jr. Survey on federated learning for intrusion detection system: Concept, architectures, aggregation strategies, challenges, and future directions. ACM Comput. Surv. 2024, 57, 1–38. [Google Scholar] [CrossRef]
Shahzad, H.; Sattar, A.R.; Skandaraniyam, J. DGA domain detection using deep learning. In Proceedings of the 2021 IEEE 5th International Conference on Cryptography, Security and Privacy (CSP), Zhuhai, China, 8–10 January 2021; pp. 139–143. [Google Scholar] [CrossRef]
Kumar, S.; Bhatia, A. Detecting domain generation algorithms to prevent ddos attacks using deep learning. In Proceedings of the 2019 IEEE International Conference on Advanced Networks and Telecommunications Systems (ANTS), Goa, India, 16–19 December 2019; pp. 1–4. [Google Scholar] [CrossRef]
Zhang, W.W.; Gong, J.; Liu, Q. Detecting machine generated domain names based on morpheme features. In Proceedings of the 1st International Workshop on Cloud Computing and Information Security, Shanghai, China, 9–11 November 2013; Atlantis Press: Dordrecht, The Netherlands, 2013; pp. 408–411. [Google Scholar] [CrossRef]
Yadav, S.; Reddy, A.K.K.; Reddy, A.N.; Ranjan, S. Detecting algorithmically generated malicious domain names. In Proceedings of the 10th ACM SIGCOMM Conference on Internet Measurement, Melbourne, Australia, 1–3 November 2010; pp. 48–61. [Google Scholar] [CrossRef]
Drichel, A.; Meyer, U.; Schüppen, S.; Teubert, D. Analyzing the real-world applicability of DGA classifiers. In Proceedings of the 15th International Conference on Availability, Reliability and Security, Online, 25–28 August 2020; pp. 1–11. [Google Scholar] [CrossRef]
Yu, B.; Pan, J.; Hu, J.; Nascimento, A.; De Cock, M. Character level based detection of DGA domain names. In Proceedings of the 2018 International Joint Conference on Neural Networks (IJCNN), Rio de Janeiro, Brazil, 8–13 July 2018; pp. 1–8. [Google Scholar] [CrossRef]
Catania, C.; García, S.; Torres, P. Deep convolutional neural networks for DGA detection. In Proceedings of the Argentine Congress of Computer Science, Buenos Aires, Argentina, 1–5 October 2018; Springer: Berlin/Heidelberg, Germany, 2018; pp. 327–340. [Google Scholar] [CrossRef]
Tran, D.; Mac, H.; Tong, V.; Tran, H.A.; Nguyen, L.G. A LSTM based framework for handling multiclass imbalance in DGA botnet detection. Neurocomputing 2018, 275, 2401–2413. [Google Scholar] [CrossRef]
Hinton, G.E.; Salakhutdinov, R.R. Reducing the dimensionality of data with neural networks. Science 2006, 313, 504–507. [Google Scholar] [CrossRef] [PubMed]
Kingma, D.P.; Welling, M. Auto-encoding variational bayes. arXiv 2013, arXiv:1312.6114. [Google Scholar]
Yang, L.; Fan, W.; Bouguila, N. Deep clustering analysis via dual variational autoencoder with spherical latent embeddings. IEEE Trans. Neural Netw. Learn. Syst. 2021, 34, 6303–6312. [Google Scholar] [CrossRef] [PubMed]
Ma, H. Achieving deep clustering through the use of variational autoencoders and similarity-based loss. Math. Biosci. Eng. 2022, 19, 10344–10360. [Google Scholar] [CrossRef] [PubMed]
Frey, B.J.; Dueck, D. Clustering by passing messages between data points. Science 2007, 315, 972–976. [Google Scholar] [CrossRef] [PubMed]
Triastcyn, A.; Faltings, B. Federated learning with bayesian differential privacy. In Proceedings of the 2019 IEEE International Conference on Big Data (Big Data), Los Angeles, CA, USA, 9–12 December 2019; pp. 2587–2596. [Google Scholar] [CrossRef]
Ding, L.; Li, L.; Han, J.; Fan, Y.; Hu, D. Detecting Domain Generation Algorithms with Bi-LSTM. Comput. Mater. Contin. 2019, 61. [Google Scholar] [CrossRef]
Gogoi, B.; Ahmed, T. DGA domain detection using pretrained character based transformer models. In Proceedings of the 2023 IEEE Guwahati Subsection Conference (GCON), Guwahati, India, 23–25 June 2023; pp. 1–6. [Google Scholar] [CrossRef]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA, 2–7 June 2019; Volume 1, (Long and Short Papers). pp. 4171–4186. [Google Scholar] [CrossRef]

Figure 1. System architecture of the proposed Federated Learning-based Intrusion Detection System (IDS/IPS). Multiple IDS nodes (e.g., routers, edge devices) independently monitor local network traffic and train local detection models using their private domain data (e.g., malicious domains, benign domains). A central server coordinates the training process using the federated learning paradigm, where only model updates are exchanged to preserve data privacy. Aggregating updates from selected IDS nodes iteratively improves the global model.

Figure 2. The Variational Autoencoder (VAE) model architecture with encoder, decoder, and latent space with reparameterization.

Figure 3. Clustering flow on each client by using the latent vector of VAE.

Figure 4. Overview of the FedSAGE framework for federated client clustering and selection, the blue arrows represent the flow between steps, while the green arrows indicate the communication flow between devices. Each client i uses the VAEs to compute and infer the latent vector on their private data, then they send the latent mean vectors (

\bar{μ_{i}}

) to the server. The server performs latent-space clustering to group clients with similar data distributions. Within each cluster, a resource-aware client selection strategy chooses the most efficient clients for training. Selected clients then participate in model training and send their updates to the server for aggregation. This process repeats periodically on each training round.

Figure 4. Overview of the FedSAGE framework for federated client clustering and selection, the blue arrows represent the flow between steps, while the green arrows indicate the communication flow between devices. Each client i uses the VAEs to compute and infer the latent vector on their private data, then they send the latent mean vectors (

\bar{μ_{i}}

) to the server. The server performs latent-space clustering to group clients with similar data distributions. Within each cluster, a resource-aware client selection strategy chooses the most efficient clients for training. Selected clients then participate in model training and send their updates to the server for aggregation. This process repeats periodically on each training round.

Figure 5. Testbed architecture of the FedSAGE framework. Clients encode local data via the VAE, send latent vectors to the server for clustering and selection, and participate in federated training based on the selection criteria.

Figure 6. PCA visualization of client’s clustering results using latent vectors from the VAE model, based on the affinity propagation algorithm. Each point in the visualization space represents a client.

Figure 7. Client-side execution time (in seconds) for local training and VAE inference under heterogeneous CPU core configurations.

Table 2. The DGA families that appear in each zone.

Zone 1	Zone 2	Zone 3	Zone 4
banjori, corebot, necurs, ramnit, reconyc	bumblebee, orchard, qadars, gozi, vawtrak	bazarbackdoor, fobber, ramdo, padcrypt, zloader	locky, newgoz, ngioweb, dnschanger, pykspa

Table 3. Summary of experimental setup.

Component	Description
Client Devices	30 Virtual Machines (VMs), Ubuntu 20.04, Intel Xeon Silver 4210 CPUs
System Heterogeneity	Random CPU cores per VM: 1 to 8 cores
Communication Layer	RabbitMQ over LAN (1 Gbps bandwidth)
Training Settings	SGD (lr = 1 × $10^{- 5}$ , momentum = 0.9), batch size = 32, up to 400 rounds
FL Framework	Custom Python (3.8.10) + PyTorch (2.2.2)
Selection Ratio	$ρ = 30 %$ (top fastest clients per cluster in FedSAGE and variants)
Data Partitioning	- Uniform: 5 K samples/label/client - Non-uniform: 2.5 K–7.5 K samples/label/client

Table 4. Top-1 test accuracy (%) for uniformly distributed client data.

DNN Model	Training Time (s)	Top-1 Accuracy (%)
DNN Model	Training Time (s)	FedAvg	FedSAGE	FedRand	FedSpeed	FLIS
	400	74.08	78.36	72.20	62.63	49.66
CNN	600	74.08	84.65	72.66	63.58	72.14
	800	76.69	87.53	77.19	65.26	72.82
	200	79.64	80.79	77.57	75.79	60.52
BiLSTM	350	80.71	86.96	80.13	81.81	70.61
	500	82.10	89.33	81.93	84.16	73.85
	500	63.02	65.98	61.52	64.30	53.37
Transformer	1000	63.70	64.37	70.55	69.71	54.42
	1500	64.86	76.09	74.80	65.64	55.76

Table 5. Top-1 test accuracy (%) for non-uniformly distributed client data.

DNN Model	Training Time (s)	Top-1 Accuracy (%)
DNN Model	Training Time (s)	FedAvg	FedSAGE	FedRand	FedSpeed	FLIS
	400	70.60	82.47	70.52	78.78	50.34
CNN	600	75.28	84.11	70.52	81.38	59.22
	800	77.85	87.85	78.48	85.66	66.24
	200	78.29	81.86	75.10	78.96	49.66
BiLSTM	350	79.14	87.62	77.89	82.48	49.66
	500	80.27	89.83	80.26	85.38	49.66
	500	65.12	66.96	62.64	63.65	53.44
Transformer	1000	66.49	73.56	67.42	68.16	54.45
	1500	67.63	80.32	69.05	72.64	54.96

Table 6. Average memory usages (in GB) for different DNN models.

Model Name	CNN	BiLSTM	Transformer	VAE
RAM Usages (GB)	0.62	0.63	0.68	0.10

Table 7. Energy consumption (Joules) required to achieve the accuracy threshold of 80%.

DNN Model	Data Distribution	Energy Consumption on Each FL Algorithms (Joule)
DNN Model	Data Distribution	FedAvg	FedSAGE	FedRand	FedSpeed	FLIS
CNN	Uniform	273,348.52	16,812.44	41,486.30	-	50,673.33
	Non-uniform	175,649.49	9484.04	32,677.66	15,375.52	58,383.31
BiLSTM	Uniform	33,878.16	6834.96	12,500.02	10,503.85	48,610.55
	Non-uniform	54,085.36	6002.80	16,979.58	8923.32	-
Transformer	Uniform	-	65,601.83	153,981.76	-	-
	Non-uniform	884,910.56	53,835.56	185,082.63	92,980.70	-

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Duc, M.V.; Dang, P.M.; Phuong, T.T.; Truong, T.D.; Hai, V.; Thanh, N.H. Detecting Emerging DGA Malware in Federated Environments via Variational Autoencoder-Based Clustering and Resource-Aware Client Selection. Future Internet 2025, 17, 299. https://doi.org/10.3390/fi17070299

AMA Style

Duc MV, Dang PM, Phuong TT, Truong TD, Hai V, Thanh NH. Detecting Emerging DGA Malware in Federated Environments via Variational Autoencoder-Based Clustering and Resource-Aware Client Selection. Future Internet. 2025; 17(7):299. https://doi.org/10.3390/fi17070299

Chicago/Turabian Style

Duc, Ma Viet, Pham Minh Dang, Tran Thu Phuong, Truong Duc Truong, Vu Hai, and Nguyen Huu Thanh. 2025. "Detecting Emerging DGA Malware in Federated Environments via Variational Autoencoder-Based Clustering and Resource-Aware Client Selection" Future Internet 17, no. 7: 299. https://doi.org/10.3390/fi17070299

APA Style

Duc, M. V., Dang, P. M., Phuong, T. T., Truong, T. D., Hai, V., & Thanh, N. H. (2025). Detecting Emerging DGA Malware in Federated Environments via Variational Autoencoder-Based Clustering and Resource-Aware Client Selection. Future Internet, 17(7), 299. https://doi.org/10.3390/fi17070299

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Detecting Emerging DGA Malware in Federated Environments via Variational Autoencoder-Based Clustering and Resource-Aware Client Selection

Abstract

1. Introduction

2. Related Work and Motivation

2.1. Federated Learning

2.2. Federated Learning for Intrusion Detection

2.3. Limitations of Existing DGA Detection Methods

2.4. Research Gap and Motivation

3. The Proposed FedSAGE Framework for Federated DGA Detection

3.1. Variational Autoencoder Architecture

3.2. Unsupervised VAE Clustering for Novelty Detection

3.2.1. Client Clustering Based on Latent Encoding

3.2.2. Privacy-Preserving Latent Encoding

3.3. Client Selection with Data-Aware Scheduling

3.4. FedSAGE—Federated Selection and Clustering for Anomaly and Generative-Domain Evaluation

4. Experiment Setup

4.1. DGA Datasets and Detection Models

4.2. Testbed Setup

4.3. Baseline

5. Results

5.1. Clustering Efficiency

5.2. Distributed Training Performance

5.3. Client Performance Analysis

5.4. Energy Efficiency

5.5. Discussion and Analysis

6. Conclusions and Future Works

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI