Hybrid GNN–LSTM Architecture for Probabilistic IoT Botnet Detection with Calibrated Risk Assessment

Babenko, Tetiana; Kolesnikova, Kateryna; Bakhtiyarova, Yelena; Yeskendirova, Damelya; Sansyzbay, Kanibek; Sysoyev, Askar; Kruchinin, Oleksandr

doi:10.3390/computers15010026

Open AccessArticle

Hybrid GNN–LSTM Architecture for Probabilistic IoT Botnet Detection with Calibrated Risk Assessment

by

Tetiana Babenko

¹

,

Kateryna Kolesnikova

²

,

Yelena Bakhtiyarova

^3,*

,

Damelya Yeskendirova

^1,*,

Kanibek Sansyzbay

³

,

Askar Sysoyev

¹ and

Oleksandr Kruchinin

⁴

¹

Department of Cybersecurity, International Information Technologies University, Almaty 050040, Kazakhstan

²

Department of Information Systems, International Information Technologies University, Almaty 050040, Kazakhstan

³

Department of Radio Engineering, Electronics and Telecommunications, International Information Technologies University, Almaty 050040, Kazakhstan

⁴

Department of Information Security and Telecommunications, Dnipro University of Technology, 49005 Dnipro, Ukraine

^*

Authors to whom correspondence should be addressed.

Computers 2026, 15(1), 26; https://doi.org/10.3390/computers15010026

Submission received: 9 December 2025 / Revised: 27 December 2025 / Accepted: 29 December 2025 / Published: 5 January 2026

(This article belongs to the Section ICT Infrastructures for Cybersecurity)

Download

Browse Figures

Versions Notes

Abstract

Detecting botnets in IoT environments is difficult because most intrusion detection systems treat network events as independent observations. In practice, infections spread through device relationships and evolve through distinct temporal phases. A system that ignores either aspect will miss important patterns. This paper explores a hybrid architecture combining Graph Neural Networks with Long Short-Term Memory networks to capture both structural and temporal dynamics. The GNN component models behavioral similarity between traffic flows in feature space, while the LSTM tracks how patterns change as attacks progress. The two components are trained jointly so that relational context is preserved during temporal learning. We evaluated the approach on two datasets with different characteristics. N-BaIoT contains traffic from nine devices infected with Mirai and BASHLITE, while CICIoT2023 covers 105 devices across 33 attack types. On N-BaIoT, the model achieved 99.88% accuracy with F1 of 0.9988 and Brier score of 0.0015. Cross-validation on CICIoT2023 yielded 99.73% accuracy with Brier score of 0.0030. The low Brier scores suggest that probability outputs are reasonably well calibrated for risk-based decision making. Consistent performance across both datasets provides some evidence that the architecture generalizes beyond a single benchmark setting.

Keywords:

internet of things (IoT); botnet detection; cybersecurity; graph neural network (GNN); long short-term memory (LSTM); hybrid deep learning; probabilistic classification; risk calibration; IoT traffic analysis; anomaly detection

1. Introduction

The Internet of Things has grown faster than anyone really expected. By 2023 there were already over 15 billion connected devices worldwide, and projections put that number somewhere beyond 30 billion by the end of this decade [1]. That kind of growth sounds impressive until you start thinking about security. Most of these devices were designed with functionality as the priority, not protection. Manufacturers face real pressure to keep costs down and get products to market quickly, which means security features often get treated as afterthoughts [2,3]. The result is an enormous attack surface, and attackers have noticed. Botnets like Mirai showed just how effectively compromised IoT devices could be weaponized for distributed denial of service attacks, cryptomining, and worse [4].

What makes securing IoT networks genuinely difficult is that the problem differs from traditional network security in some fundamental ways. The sheer number and variety of devices makes centralized monitoring impractical. Many of these devices run unattended for months or years, so infections can persist without anyone noticing. But there is something else that matters even more. Threats in IoT environments do not just appear at a single point. They spread laterally through network connections while simultaneously evolving through distinct temporal phases [5,6]. An attacker establishes an initial foothold, expands across connected devices, and escalates capabilities over time. Any detection approach that ignores either the spatial dimension or the temporal one is going to miss part of the picture.

Traditional intrusion detection methods struggle with these realities. Rule-based systems like Snort [7] need constant updates and generate too many false positives when confronted with the heterogeneous traffic patterns typical of IoT deployments. Statistical anomaly detection runs into trouble with high-dimensional data and has difficulty distinguishing genuine threats from normal behavioral variation [8]. Machine learning approaches offer more flexibility, but most treat network events as independent observations. They ignore the contextual information embedded in how devices connect to each other and how traffic patterns shift over time [9,10]. There is a gap here between what these traditional approaches can do and what IoT security actually requires.

Graph neural networks offer one possible way forward. Unlike conventional neural networks that process features in isolation, GNNs explicitly model relationships between entities [11,12]. For network security, this means they can learn from the topology itself rather than just from individual traffic features. Zhang et al. [13] demonstrated this by applying graph embedding techniques to botnet detection and showing that topological features could identify infected nodes better than traditional feature engineering. Bibi et al. [14] built a GraphSAGE-based framework that reached 89.3% accuracy on the Kitsune dataset. Their work made clear that GNNs can capture structural patterns that other methods overlook. The limitation, as they acknowledged, is that purely structural analysis misses attacks that manifest primarily through temporal changes rather than topological anomalies.

In this work, these relationships are modeled at the level of traffic flows rather than physical devices, which allows structural patterns to be inferred even when explicit communication links are absent.

Other researchers have pushed the graph-based approach further. Saad [15] tested various GNN architectures on the ISCX-Bot-2014 dataset and achieved over 94% accuracy while also demonstrating some ability to generalize to previously unseen botnet variants. Zhou and Xu [16] tackled the scalability problem by combining graph isomorphism networks with Graph SAINT-based subgraph sampling, which let them handle larger datasets without sacrificing too much accuracy. But the fundamental limitation persists. Network structure alone does not capture attacks that unfold gradually through behavioral changes rather than sudden topological shifts.

Long short-term memory networks approach the problem from the opposite direction. Hochreiter and Schmidhuber [17] designed LSTMs to handle sequential data by selectively retaining and forgetting information across time steps. This makes them well suited for detecting attack patterns that evolve temporally. Kim et al. [18] built an LSTM-based intrusion detection system that performed well on the NSL-KDD dataset by learning temporal dependencies in network traffic. The trouble is that their approach, like most LSTM applications in this domain, treats each traffic flow independently without considering how devices relate to each other topologically.

Recent work has tried various ways to enhance LSTM architectures for IoT security. Alkahtani and Aldhyani [19] combined CNNs with LSTMs to detect Mirai and BASHLITE attacks, reaching 90.88% accuracy on the N-BaIoT dataset. The CNN component extracted spatial features while the LSTM handled temporal patterns. Sinha et al. [20] proposed a similar hybrid CNN-LSTM architecture with attention mechanisms and achieved even higher accuracy. Sayegh et al. [21] addressed the class imbalance problem that plagues intrusion detection by integrating LSTM with synthetic oversampling techniques. These temporal models have shown strong results, but they still treat network topology as something implicit in the traffic features rather than modeling device relationships explicitly.

The real challenge becomes clear when you consider how actual botnet attacks work. They start with an initial compromise, which depends on network accessibility and device vulnerabilities. Then they spread through network connections in a process fundamentally shaped by topology. But the behavioral manifestation evolves temporally as infected devices progress through reconnaissance, exploitation, and command-and-control phases [22,23]. These temporal transitions carry diagnostic information that pure topological analysis cannot capture. Neither GNNs alone nor LSTMs alone see the complete picture.

A few research groups have started exploring hybrid approaches. Vitulyova et al. [24] developed a GNN-LSTM architecture for reconstructing attack vectors and demonstrated that the combination outperformed single-method approaches. Their focus was on forensic analysis after incidents rather than real-time detection, but they validated the core idea that integrating spatial and temporal information improves results. Friji et al. [25] achieved very high accuracy on multi-stage attack detection using a phased pipeline that analyzed topological and temporal features separately before combining them. The drawback of their approach is that the components do not inform each other during training, which potentially misses opportunities for richer feature interactions.

The N-BaIoT dataset has become something of a standard benchmark for this kind of research. Meidan et al. [26] created it by actually infecting nine commercial IoT devices with Mirai and BASHLITE botnets and capturing the resulting traffic. The dataset includes over 7 million instances with 115 statistical features covering packet sizes, timing, and bandwidth metrics [27,28]. It reflects realistic IoT deployment characteristics better than synthetic alternatives. Several groups have applied machine learning to N-BaIoT with varying success. Meidan et al. [26] used deep autoencoders and achieved strong results, though their approach required training separate models for each device type. Shorman et al. [29] combined grey wolf optimization with one-class SVM but at significant computational cost. Kasongo and Sun [30] explored ensemble methods to improve robustness. What these approaches share, for the most part, is a focus on classification accuracy rather than probability calibration.

That focus on accuracy misses something important for operational deployment. Security operations centers deal with thousands of alerts every day. They need to prioritize, and prioritization requires knowing not just whether something looks malicious but how confident the system is in that assessment. Binary classification forces analysts to treat every positive detection the same way regardless of the model’s certainty [31]. A system that outputs calibrated probability scores would let them distinguish between detections that demand immediate response and those that warrant investigation but not emergency procedures. The Brier score provides a way to measure whether probability estimates actually reflect true likelihoods rather than just appearing confident [32].

Babenko et al. [33] made a related point in their work on OSINT-driven cyber risk assessment. Effective security systems need to synthesize diverse information types rather than relying on a single data source. The same principle applies here. A detection architecture that can process both topological relationships and temporal patterns positions itself to incorporate additional data sources as they become available.

This paper proposes a hybrid architecture that tries to address these gaps by looking at structure and time together instead of treating them as separate problems. The GNN component is used to capture how individual flows relate to one another in a structural sense, while the LSTM follows how traffic behavior changes as events unfold. The two components are trained jointly, so information about relationships is not lost before temporal patterns are learned, which we found important in practice.

The approach is evaluated on the N-BaIoT and CICIoT2023 datasets, and the analysis goes beyond classification accuracy to also consider probability calibration through the Brier score. On N-BaIoT, the model reaches 99.88% accuracy with an F1 score of 0.9988 and a Brier score of 0.0015. On CICIoT2023, the accuracy is 99.73% with a Brier score of 0.0030. These results suggest that the probability estimates produced by the model are reasonably well calibrated and therefore more suitable for risk-based decision making than uncalibrated scores. Across both datasets, the hybrid model shows consistently stronger performance than standalone GNN and standalone LSTM baselines. This does not imply that either component is insufficient on its own, but rather that combining relational and temporal modeling appears to offer complementary benefits in this setting.

To make the contribution of this study easier to interpret, it may help to state more directly what distinguishes the proposed approach from existing work. One aspect lies in how interactions are represented in the first place. Rather than assuming that meaningful relationships must follow explicit communication paths, the model builds a dynamic graph from similarities observed in feature space. This reflects the observation that coordinated malicious behavior often appears as correlated activity patterns, even in the absence of direct communication. From this representation follows the architectural design. The graph neural component is not treated as a separate preprocessing stage whose output is later fused with a temporal model. Instead, graph embeddings are passed directly into the recurrent structure, which allows spatial context to persist as temporal dependencies are learned. This coupling was motivated by the concern that separating these stages too rigidly can obscure interactions that unfold over time. Another important element of the work concerns how model outputs are interpreted. Raw neural scores are transformed through a calibration process that combines temperature scaling with Platt scaling, with the explicit goal of producing probabilities that can be meaningfully compared and acted upon. This emphasis on calibration shifts the focus away from binary detection toward probabilistic reasoning. The evaluation strategy was also chosen with this perspective in mind. Experiments were conducted on two benchmark datasets that differ substantially in their composition, collection conditions, and temporal context, spanning a five-year gap during which attack methodologies evolved considerably. This made it possible to observe how the learned representations behave outside the narrow assumptions of a single environment.

Finally, the calibrated outputs are integrated into a contextual risk scoring formulation that accounts for operational factors such as device importance and network position. This step was introduced to bridge the gap between detection performance and response prioritization, which is often where purely accuracy-driven approaches fall short.

The remainder of the paper is organized as follows. Section 2 describes the methodology including dataset preparation, graph construction, model architecture, and training procedures. Section 3 presents experimental results with analysis across multiple performance metrics. Section 4 discusses implications, limitations, and directions for future research. Section 5 concludes with a summary of contributions.

2. Materials and Methods

2.1. Overall Framework

The architecture we developed processes raw network traffic through three interconnected stages that ultimately produce calibrated risk assessments. Figure 1 illustrates this pipeline. The first stage handles graph construction, where temporal snapshots of device interactions are transformed into structured graph representations. The idea here builds on earlier work by Kipf and Welling [11], though we adapt it for the IoT security domain. Each network flow becomes a node in the graph, and edges are constructed based on behavioral similarity in feature space rather than direct communication links. The resulting structure encodes spatial relationships that conventional intrusion detection systems tend to overlook [7]. A more detailed explanation of the graph construction process is provided in Section 2.3.

The second stage is where temporal learning happens. Sequences of graph embeddings capture how the network topology evolves as an attack unfolds. Traditional approaches often process each timestamp in isolation [8], which means they lose important information about how attacks progress over time. Our methodology addresses this by using LSTM networks [17] that maintain memory across time steps. The approach resembles what Kim et al. [18] proposed for traffic classification, though we extend it to work with graph-structured data rather than raw feature vectors. The graph construction at each time step can be expressed as:

G_{t} = GraphConstruct (X_{t}, θ_{g})

(1)

where

X_{t}

represents the feature matrix at time

t

and

θ_{g}

denotes the learnable parameters for graph topology inference [12].

The third stage transforms these spatiotemporal representations into continuous risk scores. Rather than producing binary classifications that label traffic as simply benign or malicious, our approach generates probability distributions that quantify threat severity [31]. Security administrators can then prioritize their responses based on risk magnitude instead of treating every alert with equal urgency [34].

One aspect of the framework worth noting is how information flows between stages. The graph construction module adapts dynamically to topology changes, which matters in real deployments where IoT devices frequently join or leave networks [6]. The temporal component maintains a sliding window of historical states, and we found through experimentation that this balance between computational efficiency and sufficient context works reasonably well for detecting multi-stage attacks [23,25].

2.2. Dataset Description

To evaluate the proposed architecture under varying conditions, we selected two publicly available benchmark datasets that differ substantially in their characteristics. The N-BaIoT dataset represents an earlier generation of IoT botnet research with rich statistical features extracted from a small number of devices [26,27], while CICIoT2023 reflects more recent attack methodologies across a larger and more diverse device population [30]. Using both allows us to assess whether the model captures generalizable patterns or merely overfits to the peculiarities of a single collection environment [10,30]. This dual-dataset validation strategy addresses a common criticism of IoT security research, where models often demonstrate strong performance on one benchmark but fail to transfer to different network conditions [9,14].

2.2.1. N-BaIoT Dataset

We used the N-BaIoT dataset for primary training and evaluation, which Meidan et al. [26] created by deploying nine commercial IoT devices in an isolated network environment. These devices included doorbell cameras, baby monitors, security cameras, webcams, a thermostat, a smart socket, and a motion sensor. The researchers then infected the devices with actual Mirai and BASHLITE botnet variants and captured network traffic throughout the infection lifecycle [4,5]. This methodology has been widely adopted in subsequent IoT security research [27,28,29] because it produces authentic behavioral patterns rather than synthetic approximations that might not reflect real-world attack dynamics [22].

What makes this dataset particularly valuable is the diversity of device behaviors it captures [6]. The Danmini doorbell generates event-driven traffic bursts when someone presses the button, while the Philips baby monitor produces continuous video streams with low latency requirements. The Provision security cameras send periodic heartbeat messages and spike in activity when motion is detected [26]. This heterogeneity creates a challenging classification problem where normal behavior varies substantially across device types, and the model must learn to distinguish device-specific anomalies from cross-device attack signatures [13,14]. Table 1 summarizes the traffic characteristics for each device along with the sample counts for benign and attack traffic.

The dataset provides 115 statistical features extracted from network flows using the Argus tool, covering packet size distributions, inter-arrival times, and bandwidth consumption metrics across multiple time windows [26,28]. These features were computed over five temporal aggregations (100 ms, 500 ms, 1.5 s, 10 s, and 1 min), capturing both short-term bursts and longer-term behavioral patterns [27]. The feature engineering approach follows established practices in network traffic analysis [8,9], though it requires substantial computational resources for real-time extraction in production deployments [30].

Figure 2 illustrates the class distribution across devices in the N-BaIoT dataset. The severe imbalance between benign and attack samples, with attack traffic outnumbering benign by ratios ranging from 7:1 to nearly 100:1 depending on the device, presents a significant challenge for classifier training [21,29]. We addressed this through balanced sampling rather than synthetic oversampling techniques like SMOTE, which can introduce artifacts that do not represent genuine attack behavior [21].

For our experiments, we sampled 500,000 instances from each class to create a balanced dataset of one million network flows [29,30]. The data was partitioned into training, validation, and test sets following a 70/15/15 ratio with stratification to preserve class distributions across device types, resulting in 700,000 training samples, 150,000 validation samples, and 150,000 test samples [9]. The stratified sampling ensures that each device type is represented proportionally in all splits, preventing the model from overfitting to majority device categories [10].

The N-BaIoT dataset is publicly available at the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets/detection_of_IoT_botnet_attacks_N_BaIoT, 28 December 2025).

2.2.2. CICIoT2023 Dataset

The CICIoT2023 dataset was developed by the Canadian Institute for Cybersecurity to address limitations in earlier IoT security benchmarks [35]. Where N-BaIoT focused on depth with rich feature extraction from a handful of devices, CICIoT2023 prioritizes breadth by capturing traffic from 105 IoT devices spanning diverse categories including smart home appliances, healthcare monitors, industrial sensors, and wearable technology [35,36]. The attack traffic encompasses 33 distinct attack types organized into seven categories: DDoS, DoS, reconnaissance, web-based attacks, brute force, spoofing, and Mirai variants [35].

The dataset was collected in 2023 using contemporary attack tools and methodologies, which matters because attack techniques evolve rapidly [4,5,22]. Botnets from 2018 behave differently than those compiled with modern evasion capabilities, and a detection system trained exclusively on older attack patterns may struggle with current threats [22,23]. The temporal gap of five years between N-BaIoT (2018) [26] and CICIoT2023 (2023) [35] provides an opportunity to assess whether the proposed architecture captures fundamental attack characteristics that persist across botnet generations or relies on signatures specific to particular malware versions [25].

The device population in CICIoT2023 reflects the expanding IoT ecosystem more accurately than earlier benchmarks [35,36]. Beyond the consumer devices typical of N-BaIoT, the dataset includes medical IoT devices such as blood pressure monitors and glucose meters, industrial sensors for temperature and humidity monitoring, and smart building infrastructure including HVAC controllers and lighting systems [35]. This diversity introduces additional complexity because normal traffic patterns vary even more dramatically across device categories than within the consumer IoT space alone [6,14].

Table 2 summarizes the attack categories and their distribution in the dataset. The diversity of attack types is notable, ranging from volumetric DDoS variants that flood networks with traffic to more subtle reconnaissance activities that might otherwise blend with legitimate device behavior [22,23,25].

The feature space differs substantially from N-BaIoT, which provides an important test of architectural flexibility [35]. CICIoT2023 includes 46 features in its original form, though we used the 39 most discriminative features identified by Neto et al. [35] for computational efficiency. These features focus on flow-level statistics including protocol flags, packet counts, byte counts, flow duration, and inter-arrival time statistics [35,36]. While fewer in number than N-BaIoT’s 115 features, these attributes capture essential traffic characteristics without requiring the deep packet inspection that becomes problematic with encrypted traffic [3,8].

Figure 3 shows the attack category distribution in CICIoT2023. The dominance of DDoS attacks (60.7%) reflects their prevalence in real-world IoT botnets [4,5], but also creates potential for models to achieve high accuracy by simply learning DDoS signatures while neglecting less common attack types [21]. We mitigated this by using stratified sampling that ensures representation of all attack categories during training [9,30].

Figure 4 compares the feature spaces of the two datasets, highlighting both the different dimensionalities (115 vs. 39 features) and the partial overlap in feature types. Both datasets include packet size statistics, inter-arrival times, and flow duration metrics, though computed using different extraction tools and aggregation windows [26,35]. This partial overlap allows us to assess whether the GNN-LSTM architecture learns abstract traffic patterns that transfer across feature representations or becomes dependent on specific feature engineering choices [12,14].

For our experiments, we used 400,000 samples with balanced class distribution (200,000 benign, 200,000 attack). We employed stratified five-fold cross-validation rather than a single train-test split to obtain more robust performance estimates with confidence intervals [9,30]. This approach is particularly important for CICIoT2023 given its larger device population and attack variety, where a single random split might produce biased estimates depending on which attack types happen to concentrate in the test set [10,21]. Each fold used approximately 272,000 training samples, 48,000 validation samples, and 80,000 test samples.

The CICIoT2023 dataset is publicly available at https://www.unb.ca/cic/datasets/iotdataset-2023.html (28 December 2025) [35].

2.2.3. Data Preprocessing

Both datasets required preprocessing before training, though the specific steps differed given their distinct characteristics [8,9]. The preprocessing pipeline was designed to maintain consistency in how the model receives input while respecting the inherent differences between datasets [10].

For N-BaIoT, we applied z-score normalization independently to each of the 115 features after the train-test split to prevent data leakage [9]:

x_{n o r m} = \frac{x - μ_{t r a i n}}{σ_{t r a i n}}

(2)

where

μ_{t r a i n}

and

σ_{t r a i n}

are computed exclusively from training data and then applied to validation and test sets [30]. The severe class imbalance in the raw data, with attack samples outnumbering benign by roughly 15 to 1 across all devices combined, was addressed through balanced sampling rather than oversampling techniques that might introduce synthetic artifacts [21,29].

CICIoT2023 preprocessing followed a similar normalization procedure for its 39 features [35]. We encountered missing values in approximately 0.3% of samples, concentrated in flow duration and inter-arrival time features where connection timeouts produced undefined values [36]. These were handled through median imputation within each feature dimension, computed separately for benign and attack classes to preserve distributional characteristics [8]. The attack category labels were consolidated into a binary classification scheme matching the N-BaIoT formulation, with all 33 attack types grouped as the positive class [35].

For temporal sequencing, we constructed overlapping windows of length T = 24 for both datasets, selected through experimentation that balanced computational cost with sufficient temporal context for capturing multi-stage attack patterns [19,20,24]. Each window slides by one timestamp, creating dense coverage while maintaining sequence continuity. This generates approximately

n

− 23 training sequences from

n

raw samples. The window length aligns with findings from Kim et al. [18] and Alkahtani and Aldhyani [19], who reported that windows between 20 and 30 time steps capture relevant temporal dependencies for network intrusion detection

Table 3 summarizes the final dataset configurations used in our experiments, highlighting the key differences that make cross-dataset validation meaningful [10,14].

2.3. Graph Construction and Feature Representation

Transforming unstructured network traffic into meaningful graph representations is perhaps the most nuanced aspect of our methodology. The challenge is that relationships between devices are not always explicit. Network address translation can obscure direct communication patterns, payloads may be encrypted, and multi-hop routing complicates attribution [3]. To avoid potential confusion, it is useful to state more explicitly what the constructed graph is intended to represent. In the proposed formulation, a node corresponds to a single network flow observed at a specific time step and described by its statistical feature vector. Nodes therefore do not map directly to physical devices, even though the flows themselves originate from device activity. The edges are not derived from observed communication links or network topology. Instead, they are introduced on the basis of similarity in feature space, so that two flows become connected when their behavioral characteristics exhibit sufficient correlation. This connection can arise even when the underlying devices have not exchanged traffic directly. In this sense, the graph captures behavioral proximity rather than physical connectivity. When earlier parts of the manuscript refer to relationships between devices, this should be understood in an abstract way. The intention is to describe implicit behavioral correlations among flows that may stem from devices experiencing similar compromise conditions, rather than to imply the existence of explicit communication paths in the network. Our approach addresses these obstacles by constructing graphs based on statistical feature similarity rather than relying on direct communication links [13].

For each temporal we define nodes as individual feature vectors representing flow states at that moment [12]. The adjacency matrix construction uses adaptive thresholding based on feature correlation:

A_{i j} = 1 i f s i m (x_{i}, x_{j}) > τ, o t h e r w i s e 0

(3)

where sim(⋅) denotes cosine similarity and

τ

represents an adaptive threshold learned during training [14].

This formulation captures something important about how botnet infections actually behave. Infected devices often exhibit correlated behavioral changes even when they are not communicating directly with each other [22]. Devices compromised by the same botnet variant tend to display synchronized scanning patterns, and our graph representation creates implicit edges between such devices that packet-level analysis would miss. Figure 5 illustrates the graph construction process.

Node features pass through a two-stage embedding process [11]. The raw 115-dimensional feature vectors first go through a projection layer that reduces dimensionality while preserving discriminative information. Then graph convolutional layers aggregate neighborhood information, allowing each node to incorporate context from connected devices. The aggregation follows the message-passing paradigm:

h_{i}^{(l + 1)} = σ (W^{(l)} \cdot AGG (\{h_{j}^{(l)} : j \in N (i) \cup {i}\}))

(4)

where

h_{i}^{(l)}

represents node

i ’ s

embedding at layer

l

,

N (i)

denotes its neighborhood, and AGG is the aggregation function [15,16].

We also had to handle missing or corrupted data, which is a practical concern in real deployments [37]. Network packet loss, sensor failures, or transmission errors can result in incomplete feature vectors. Rather than discarding these samples, we implemented a graph-aware imputation strategy where missing features are estimated based on the weighted average of neighboring nodes. This approach leverages spatial correlations to maintain data integrity even when some devices report incomplete information.

2.4. Hybrid Model Architecture

The architectural design emerged through extensive experimentation with various configurations. We ultimately converged on a relatively streamlined architecture that balances expressiveness with computational efficiency [17]. The model has two primary components working together. The first is a graph neural network with two GCN layers, each containing 64 hidden units with batch normalization and dropout regularization at 0.4 probability. The second is a single-layer LSTM with 64 hidden units that captures temporal dynamics across the sequence of graph embeddings [24].

The integration between these components was a critical design decision [25]. Rather than treating them as independent modules with an intermediate projection between them, we implemented tight coupling where graph embeddings feed directly into LSTM cells. This preserves the spatial information encoded by the GNN while allowing the LSTM to learn temporal transitions in the embedding space. Table 4 summarizes the architecture specifications.

The forward pass works as follows. At each time step, the GNN processes the current graph to produce node embeddings, which are then aggregated through global mean pooling to create a graph-level representation [12]:

z_{t} = GlobalMeanPool (GNN (G_{t}))

(5)

These graph embeddings form a sequence

[z_{1}, z_{2}, \dots, z_{T}]

that feeds into the LSTM [18]:

h_{t}, c_{t} = L S T M (z_{t}, h_{\{t - 1\}}, c_{\{t - 1\}})

(6)

where

h_{t}

and

c_{t}

represent the hidden and cell states, respectively, maintaining long-term dependencies crucial for multi-stage attack detection [25].

Regularization proved crucial for preventing overfitting [21]. Beyond standard dropout and weight decay, we implemented domain-specific techniques. Graph dropout randomly removes edges during training, which forces the model to learn robust representations that do not depend on specific connectivity patterns. Temporal dropout masks entire time steps, improving the model’s ability to handle missing data during inference. Ablation studies showed that removing any single regularization technique degraded performance by 3 to 5 percentage points [19].

We explored deeper architectures during hyperparameter search, including 3-layer GNNs and 2-layer LSTMs [30]. The additional layers provided marginal accuracy improvements at substantial computational cost. A second GNN layer added only about 2 percent accuracy while doubling inference time. This observation of diminishing returns led us to the current architecture, which achieves near-optimal performance with reduced computational requirements suitable for resource-constrained deployment [16].

2.5. Risk Scoring and Calibration

Transforming model outputs into calibrated risk scores is often neglected in security systems, but it matters considerably for operational deployment [31]. Raw neural network outputs typically exhibit poor calibration, meaning the predicted probabilities do not reflect true likelihoods. A model might output 0.9 probability for predictions that are actually correct only 70 percent of the time. Our risk scoring function addresses this through a multi-stage calibration pipeline [32]. The initial computation applies temperature scaling to raw logits [31]:

R_{r a w} = s o f t m a x (\frac{l o g i t s}{T})

(7)

Platt scaling then transforms these initial probabilities into calibrated scores [33]:

R_{f i n a l} = \frac{1}{(1 + \exp (- (a \times R_{r a w} + b)))}

(8)

where

a

and

b

are parameters fitted using maximum likelihood on a held-out calibration set.

The calibration process revealed interesting patterns in model behavior. Before calibration, the model showed systematic overconfidence, particularly for borderline cases where true class probability was around 0.5. After calibration, the expected calibration error dropped from 0.038 to 0.012, indicating substantially improved probability estimates. Table 5 shows the calibration metrics.

Beyond numerical calibration, our scoring function incorporates contextual factors that influence threat severity [34]. Device criticality weights adjust scores based on potential impact. A compromised security camera protecting critical infrastructure receives higher weighting than an infected smart light bulb. Temporal persistence considers how long suspicious patterns have continued, with sustained anomalies receiving progressively higher scores. Network position accounts for connectivity, recognizing that highly connected nodes pose greater propagation risk [33]. The final aggregation combines these factors:

R = R_{\{f i n a l\}} \times w_{\{c r i t i c a l\}} \times (1 + λ \times t_{\{p e r s i s t\}}) \times (1 + μ \times \deg (n o d e))

(9)

where

w_{{c r i t i c a l}}

represents device criticality derived from role analysis [38],

t_{\{p e r s i s t\}}

measures persistence duration, and

d e g (n o d e)

indicates network connectivity.

This risk-based approach enables graduated responses proportional to threat magnitude [34]. Low scores might warrant only logging and monitoring, medium scores could trigger automated containment procedures [39], and high scores demand immediate human intervention. This optimization of response allocation matters in IoT environments where security teams cannot realistically investigate every anomaly [40].

All experiments were conducted on a laptop workstation equipped with an Intel Core i7-8565U processor running at 1.80 GHz, 16 GB of RAM, and Windows 11 as the operating system. Model training was performed using PyTorch v1.13.1 without GPU acceleration. As a result, training times were relatively long, and each major experiment required several days to complete.

3. Results

This section presents experimental findings from evaluating the proposed hybrid GNN-LSTM architecture on the N-BaIoT and CICIoT2023 benchmark datasets. The evaluation encompasses classification performance metrics, probability calibration assessment, and comparative analysis with existing approaches documented in the literature. All experiments were implemented using PyTorch [41] with PyTorch Geometric extensions. The Adam optimizer [42] was employed with the hyperparameter configuration specified in Table 4. Following the recommendations of Raschka [43] for rigorous model evaluation, we report multiple complementary metrics rather than relying on any single performance measure.

3.1. Classification Performance on N-BaIoT

The hybrid architecture achieved strong classification performance on the N-BaIoT test set, which contained 149,978 samples distributed equally between benign and attack traffic classes. Table 6 presents the complete classification metrics. The model attained 99.88% accuracy, which compares favorably with the original deep autoencoder approach proposed by Meidan et al. [26] that first introduced this dataset. Their autoencoder-based anomaly detection achieved detection rates varying between 88% and 100% depending on the device type, with the variability arising from the necessity of training separate models for each device category. The unified architecture presented here eliminates this device-specific training requirement while maintaining consistently high performance across all device types represented in the dataset. This improvement aligns with observations by Wu et al. [44] regarding the capacity of graph neural networks to learn transferable representations across heterogeneous node types.

The precision of 99.82% and recall of 99.94% indicate that the model achieves an effective balance between false alarm minimization and attack detection completeness. This balance is particularly relevant for operational intrusion detection systems, where excessive false positives can lead to alert fatigue among security analysts [9,10]. Axelsson [45] demonstrated through formal analysis that the base-rate fallacy poses fundamental challenges for intrusion detection, whereby even highly accurate classifiers can produce unmanageable false alarm rates when the prior probability of attack is low. The balanced sampling strategy employed in our experiments [46] addresses this concern by ensuring equal representation of attack and benign traffic during training, though the extremely low false positive rate of 0.18% suggests the model would remain practical even under realistic base-rate conditions. The F1-score of 0.9988 confirms the equilibrium between precision and recall. Notably, the recall exceeds precision by a small margin, suggesting the model exhibits a slight bias toward attack detection rather than conservative classification. For security applications, this asymmetry is generally preferable to the opposite configuration, as missed attacks typically carry more severe consequences than false alarms [8,45].

The confusion matrix presented in Figure 6 provides granular insight into the classification distribution. Of 74,989 benign samples, the model correctly classified 74,851 instances while misclassifying 138 as attacks. For the attack class, 74,941 of 74,989 samples were correctly identified, with only 48 attacks evading detection as false negatives. This distribution yields a false positive rate of 0.18% and a false negative rate of 0.06%. The relatively lower false negative rate aligns with the design philosophy of prioritizing attack detection, which is consistent with recommendations from the intrusion detection literature [9,47]. Sommer and Paxson [47] critically examined the application of machine learning to network intrusion detection and identified several practical challenges, including the difficulty of achieving low false negative rates without simultaneously inflating false positives. The results presented here suggest that the hybrid GNN-LSTM architecture navigates this tradeoff effectively, achieving both metrics at levels substantially below typical operational thresholds.

The receiver operating characteristic curve depicted in Figure 7 demonstrates the model’s discrimination capability across the complete range of classification thresholds. The area under the ROC curve reached 0.9995, indicating near-optimal separation between the benign and attack class probability distributions. Hanley and McNeil [48] established that AUROC provides a threshold-independent measure of classifier performance, representing the probability that a randomly selected positive instance receives a higher score than a randomly selected negative instance. The near-unity value obtained here indicates that the model assigns higher attack probabilities to actual attacks than to benign traffic in 99.95% of random pairwise comparisons. This performance exceeds results reported by several recent studies employing the same dataset. Alkahtani and Aldhyani [19] achieved 90.88% accuracy using a CNN-LSTM hybrid for Mirai and BASHLITE detection, while their approach required separate processing pathways for spatial and temporal features. The architecture proposed here integrates these dimensions more tightly through the GNN-LSTM coupling, which may explain the performance improvement. Zhang et al. [49] similarly demonstrated that graph-based representations enhance intrusion detection by capturing relational patterns invisible to instance-based classifiers.

The precision–recall curve in Figure 8 provides complementary evidence of classification quality, with the area under the PR curve reaching 0.9993. Davis and Goadrich [50] demonstrated that precision–recall curves offer advantages over ROC curves when class distributions are imbalanced, as they remain sensitive to changes in the minority class that ROC curves may obscure. Although our experimental design employed balanced sampling, the near-unity AUCPR indicates that high precision is maintained across recall levels, suggesting the model does not sacrifice precision to achieve high recall or vice versa.

This observation is particularly important because, as Johnson and Khoshgoftaar [46] noted in their survey on class imbalance, many deep learning models struggle to maintain this balance when confronted with the severe class skew typical of network traffic data. The consistent performance across both ROC and PR perspectives suggests the model has learned robust decision boundaries rather than exploiting class distribution artifacts.

Beyond standard classification metrics, the Matthews correlation coefficient of 0.9975 provides a more comprehensive assessment of binary classification quality. Flach [51] emphasized that accuracy can be misleading when class distributions are skewed, as a classifier that simply predicts the majority class achieves high accuracy without learning anything meaningful. MCC addresses this limitation by accounting for all four quadrants of the confusion matrix and producing a balanced measure that ranges from −1 to +1, where values near +1 indicate perfect prediction, 0 indicates random prediction, and −1 indicates complete disagreement. The near-unity MCC value indicates the model achieves genuine discrimination rather than exploiting class distribution artifacts. Cohen’s kappa coefficient, also at 0.9975, confirms this interpretation by quantifying agreement beyond chance expectation. Khraisat et al. [10] recommended reporting both metrics for intrusion detection evaluations, as they provide complementary perspectives on classifier reliability. The convergence of both measures at essentially the same high value reinforces confidence in the model’s discriminative capability. Shorman et al. [29] reported MCC values around 0.95 using one-class SVM with grey wolf optimization on the same dataset, suggesting the hybrid architecture provides meaningful improvement over single-paradigm approaches.

3.2. Cross-Validation Results on CICIoT2023

Validation on a second dataset with substantially different characteristics provides evidence regarding the generalizability of the proposed architecture. Sommer and Paxson [47] identified generalization failure as a primary obstacle to deploying machine learning in operational security systems, observing that models frequently achieve excellent performance on training data while failing to transfer to production environments. CICIoT2023 differs from N-BaIoT across multiple dimensions that affect classification difficulty [35]. The device population expands from 9 to 105 devices spanning consumer electronics, medical monitors, industrial sensors, and smart building infrastructure. The attack taxonomy grows from 10 to 33 distinct attack types organized into seven categories. Perhaps most significantly, the feature space contracts from 115 to 39 dimensions, requiring the model to extract discriminative information from a more compressed representation. Neto et al. [35] designed CICIoT2023 specifically to address limitations in earlier IoT security datasets, including the need for greater device diversity and more contemporary attack implementations. The five-year temporal gap between dataset collections (2018 versus 2023) further tests whether learned representations capture enduring attack characteristics or merely memorize transient signatures [51].

We employed stratified five-fold cross-validation to obtain robust performance estimates with quantified uncertainty. This evaluation strategy addresses concerns raised by Raschka [43] regarding the sensitivity of machine learning results to particular train-test splits, especially when datasets contain heterogeneous subpopulations. Table 7 presents the aggregated metrics across all folds, reported as mean values with standard deviations and 95% confidence intervals. Mean accuracy reached 99.73% with a remarkably low standard deviation of 0.02%, indicating consistent performance regardless of which samples comprise the training versus test partitions. This stability is noteworthy given the dataset’s heterogeneity. Different folds necessarily contain different mixtures of device types and attack categories, yet the model maintains nearly identical performance across all partitions. The narrow confidence intervals throughout the metrics suggest that reported values would likely replicate in future evaluations using similar methodology.

A notable pattern emerges in the precision–recall relationship on CICIoT2023. Precision approaches unity at 99.99%, substantially exceeding the 99.82% achieved on N-BaIoT. However, recall decreases to 99.46%, roughly 0.5 percentage points below the N-BaIoT result. This shift suggests the model learned more conservative decision boundaries on the larger dataset, requiring stronger evidence before classifying traffic as malicious. The practical implication is that false positives become extremely rare while false negatives increase modestly. For organizations prioritizing analyst workload management over absolute detection completeness, this tradeoff may be acceptable or even desirable. Chandola et al. [8] observed that anomaly detection systems inherently face this fundamental tension, and the optimal operating point depends on organizational context and threat model assumptions. Gates and Taylor [52] provocatively challenged the anomaly detection paradigm itself, arguing that the concept of normality is poorly defined in real networks. The high performance achieved here on a dataset with 105 different device types suggests the GNN component effectively models heterogeneous baseline behaviors, addressing this concern to some extent.

The aggregated confusion matrix in Figure 9 illustrates the classification distribution averaged across folds. Per fold, the model correctly classified approximately 39,985 benign samples while misclassifying only 3 as attacks on average. For attack traffic, correct classifications averaged 39,772 per fold, with 216 false negatives. The extremely low false positive count of 3 per fold is striking and warrants interpretation. It indicates the model has learned highly specific attack signatures that rarely trigger on benign traffic, even across the diverse device population represented in CICIoT2023. This specificity likely reflects the GNN component’s ability to identify topological anomalies that distinguish attack traffic from device-specific benign variations [13,14]. Xu et al. [53] demonstrated that graph neural networks possess theoretical expressiveness advantages for distinguishing graph structures, which may explain why the architecture achieves such high specificity across heterogeneous devices.

Table 8 provides the detailed per-fold breakdown, enabling assessment of result consistency. The accuracy range spans from 99.70% (Fold 4) to 99.75% (Fold 2), a total variation of just 0.05 percentage points. This narrow spread indicates that the model’s performance does not depend sensitively on the particular samples comprising each fold. Notably, Fold 4 achieved perfect precision (100.00%) but the lowest recall (99.39%), representing the most conservative classification behavior among the five partitions. Fold 2 showed the highest recall (99.51%) with slightly lower but still exceptional precision (99.99%). These fold-level variations likely reflect different mixtures of attack categories, with reconnaissance attacks [22,23] and certain web-based attack variants proving more difficult to distinguish from benign traffic than volumetric DDoS attacks. Esmaeili et al. [22] observed similar detection difficulty gradients when applying GNN-based methods to Mirai and Gafgyt variants, with stealthier attack phases presenting greater classification challenges.

The box plot visualization in Figure 10 illustrates metric distributions across folds. The narrow interquartile ranges confirm quantitatively what the standard deviations suggest qualitatively: performance varies minimally across data partitions. No outlier folds appear in any metric, which would raise concerns about data leakage or problematic train-test contamination [43]. The ROC curves for all five folds, presented in Figure 11, cluster tightly with AUROC values ranging from 0.9981 to 0.9985. The precision–recall curves in Figure 12 demonstrate similar consistency, with AUCPR values between 0.9988 and 0.9990. This convergence across folds provides confidence that reported performance reflects genuine model capability rather than favorable random sampling. The stability also suggests the architecture would maintain similar performance on future data drawn from the same distribution, an important consideration for operational deployment [47,54].

3.3. Probability Calibration Analysis

Classification accuracy provides an incomplete picture for operational security systems. When a model outputs probability estimates, those estimates should reflect true likelihoods to enable risk-based decision making [31]. This property, formally termed calibration, determines whether probability outputs can be trusted for prioritizing security responses. Guo et al. [31] demonstrated that modern neural networks often exhibit systematic miscalibration, typically manifesting as overconfidence in predictions. Niculescu-Mizil and Caruana [55] provided empirical evidence that even well-performing classifiers frequently produce poorly calibrated probabilities, necessitating post hoc calibration procedures. The calibration analysis presented here addresses this concern by evaluating whether the proposed architecture produces reliable probability estimates suitable for the graduated response framework described in Section 2.5.

We assessed calibration using the Brier score, which Brier [32] introduced as a proper scoring rule for probabilistic predictions. The Brier score equals the mean squared error between predicted probabilities and binary outcomes, ranging from 0 (perfect calibration) to 1 (complete miscalibration). Bishop [56] noted that proper scoring rules like the Brier score incentivize honest probability reporting, as the expected score is minimized when predicted probabilities match true conditional probabilities. On N-BaIoT, the model achieved a Brier score of 0.0015, while CICIoT2023 yielded 0.0030 with standard deviation 0.0002 across folds. Both values fall substantially below the 0.01 threshold that is conventionally interpreted as indicating excellent calibration [31,55]. The doubling of Brier score between datasets reflects the greater classification difficulty posed by CICIoT2023′s expanded attack taxonomy, though both values remain well within acceptable ranges for risk-sensitive applications. Table 9 compares the probabilistic metrics across both datasets.

The consistently low Brier scores indicate that when the model predicts 90% probability of attack, approximately 90% of such samples are indeed malicious. This reliability contrasts with many deep learning classifiers that produce confident predictions regardless of actual certainty [31,55]. Pearl [57] emphasized the importance of well-calibrated probabilities for rational decision-making under uncertainty, a principle that extends directly to security operations contexts. The risk scoring function described by Equations (7) and (8) depends fundamentally on calibrated probability inputs, and the empirical results confirm this prerequisite is satisfied. Security operations teams can therefore trust the probability outputs for prioritizing investigation and response activities, allocating resources proportionally to predicted threat severity rather than treating all detections equivalently [33,58]. Klein et al. [58] studied rapid decision-making in high-stakes environments and found that well-calibrated probability information significantly improved expert decision quality.

The calibration curves in Figure 13 and Figure 14 visualize the relationship between predicted probabilities and observed frequencies. A perfectly calibrated model produces points lying exactly on the diagonal where predicted probability equals empirical frequency [55]. The N-BaIoT calibration curve (Figure 13) tracks this diagonal closely across the probability range, with minor deviations only at extreme probabilities where sample counts become sparse. The CICIoT2023 calibration curve (Figure 14) shows slightly more variation but remains well-aligned with the ideal diagonal. The temperature scaling and Platt scaling procedures described in Section 2.5 [31] contribute to this alignment by correcting the systematic overconfidence typical of neural network classifiers. Expected calibration error, which measures the average gap between predicted probability and observed frequency, remained below 0.02 for both datasets, indicating reliable probability estimation across the prediction range.

The probability distribution histograms in Figure 15 and Figure 16 reveal the model’s output characteristics. For N-BaIoT (Figure 15), benign samples concentrate near probability 0 while attack samples concentrate near probability 1, with minimal overlap in the intermediate range. This bimodal distribution explains the high classification accuracy: the model exhibits strong conviction for most samples, placing them firmly in one class or the other. The CICIoT2023 distributions (Figure 16) show a similar pattern with slightly broader tails, reflecting increased uncertainty for some attack categories. The reconnaissance attacks in particular, which involve subtle probing behaviors [22,23,59], may account for samples receiving intermediate probabilities. Hutchins et al. [59] described how early-stage attack reconnaissance often mimics legitimate network discovery activity, creating inherent ambiguity that any detection system must navigate.

The threshold sweep analyses in Figure 17 and Figure 18 characterize how classification metrics vary with the decision threshold. At the default threshold of 0.5, the F1-score is maximized for both datasets, confirming appropriate threshold selection. However, organizations may legitimately choose alternative operating points based on their specific requirements [9,51]. Lowering the threshold increases recall at the cost of precision, appropriate for high-security environments where missing attacks carries severe consequences. Raising the threshold has the opposite effect, suitable for organizations experiencing alert fatigue who prefer fewer but higher-confidence detections. The smooth curves indicate stable behavior across thresholds without abrupt transitions that might complicate threshold selection in practice. This stability reflects well-separated class distributions and supports flexible deployment across diverse operational contexts [59,60].

3.4. Cross-Dataset Comparison and Generalization Assessment

Comparing performance across the two benchmark datasets enables assessment of whether the architecture captures generalizable patterns or merely exploits dataset-specific artifacts. This concern is particularly salient for machine learning approaches to security, as Sommer and Paxson [47] documented numerous cases where seemingly effective classifiers failed catastrophically when deployed outside their training distribution. Quiñonero-Candela et al. [51] formalized the problem of dataset shift and demonstrated its prevalence across application domains. Table 10 presents a comprehensive cross-dataset comparison. The model achieves slightly higher accuracy on N-BaIoT (99.88%) compared to CICIoT2023 (99.73%), a difference of 0.15 percentage points. This modest degradation is noteworthy given the substantially greater complexity of CICIoT2023: an elevenfold increase in device count, threefold increase in attack types, and threefold reduction in feature dimensionality. That performance remains this stable across such different conditions suggests the hybrid architecture extracts fundamental attack characteristics that persist across datasets rather than learning superficial correlations specific to particular data collection environments.

The recall difference of 0.48 percentage points between datasets warrants careful interpretation. On N-BaIoT, the false negative rate was 0.06%, meaning the model missed approximately 1 in 1600 attacks. On CICIoT2023, this rate increased to 0.54%, or roughly 1 in 185 attacks. While still a small absolute rate, the relative increase is substantial. Several factors likely contribute. CICIoT2023 includes reconnaissance attacks involving subtle host discovery and port scanning behaviors [35,59] that may generate traffic patterns overlapping with legitimate device activity. Staniford and Plohmann et al. [61,62] analyzed domain-generating malware and noted that stealthy communication patterns are specifically designed to evade detection by mimicking benign traffic. The broader attack taxonomy also means some attack types appear with relatively low frequency, potentially limiting the model’s exposure during training. Esmaeili et al. [22] noted similar challenges with reconnaissance detection in their GNN-based approach, suggesting this is a general difficulty rather than a limitation specific to the proposed architecture.

The feature dimensionality reduction from 115 to 39 features represents a significant challenge for any machine learning approach. N-BaIoT’s rich feature space, computed using the Argus tool across multiple temporal aggregation windows [26], provides detailed behavioral characterization of each traffic flow. CICIoT2023’s more compact representation retains essential flow-level statistics but discards some information that may aid discrimination [35]. That classification performance degrades by less than 0.2 percentage points despite losing two-thirds of the feature dimensions suggests the GNN-LSTM architecture extracts discriminative patterns efficiently. LeCun, Bengio, and Hinton [60] observed that deep learning architectures excel at learning hierarchical representations that distill essential structure from high-dimensional inputs. The GNN component may contribute particularly here, as graph-based representations can capture relational information that compensates for reduced per-sample feature detail [13,14,44]. Wu et al. [44] surveyed graph neural network architectures and identified their capacity for learning representations that aggregate neighborhood structure, potentially enabling the model to reconstruct useful information from relationships among samples even when individual sample features are limited.

Comparison with results reported in the original dataset publications and subsequent literature provides external validation. Meidan et al. [26] reported device-specific detection rates between 88% and 100% using deep autoencoders on N-BaIoT, with performance varying substantially across device types. The unified 99.88% accuracy achieved here demonstrates that cross-device generalization is possible with appropriate architectural choices. For CICIoT2023, Neto et al. [35] reported baseline machine learning results with accuracy below 80% and F1-scores under 50% for the full 33-class multi-attack classification problem. While our binary classification formulation differs from their multi-class setting, the 99.73% accuracy substantially exceeds their baseline deep neural network performance of approximately 95% on the binary detection task. Chen and Guestrin’s XGBoost [63] has been applied to N-BaIoT with reported accuracy around 99.97% [27], though gradient boosting approaches lack the temporal modeling capacity essential for detecting attacks that manifest through behavioral evolution rather than instantaneous anomalies.

3.5. Training Dynamics and Convergence Analysis

Examining the training process provides diagnostic information about model learning and potential overfitting. LeCun, Bengio, and Hinton [60] emphasized that understanding training dynamics is essential for deep learning practitioners, as pathological behaviors often manifest through characteristic patterns in loss curves. Figure 19 displays the training history for N-BaIoT, showing loss and accuracy trajectories for both training and validation sets. The model converged rapidly, with validation accuracy exceeding 99.8% within the first few epochs. Training continued for 17 epochs before early stopping activated based on validation loss stagnation. The close tracking between training and validation curves indicates the regularization strategy successfully prevented overfitting despite the model’s capacity to memorize training examples. The absence of divergence between training and validation metrics suggests the learned representations generalize beyond the specific samples encountered during training [17,18,60].

The learning rate schedule employed adaptive reduction following the ReduceLROnPlateau strategy. Initial training used a learning rate of 0.0001, which the scheduler reduced to 0.00005 around epoch 8 when validation loss improvement slowed. A further reduction to 0.000025 occurred near training completion. Kingma and Ba [42] recommended such adaptive schedules for Adam optimization, noting that reducing learning rates enables fine-tuning once the optimizer approaches a local minimum. The weight decay of 0.01, implementing L2 regularization, penalized large weights and contributed to generalization. Masters and Luschi [64] demonstrated that appropriate batch sizes interact with learning rate schedules to determine optimization dynamics; the batch size of 128 employed here represents a compromise between computational efficiency and gradient quality. Kim et al. [18] recommended similar regularization strategies for LSTM-based intrusion detection, noting that temporal models are particularly susceptible to overfitting on sequential patterns that do not generalize to novel attack variants.

For CICIoT2023, training dynamics varied across the five cross-validation folds but exhibited consistent patterns. Early stopping triggered between epochs 21 and 30 depending on the fold, with validation F1-scores stabilizing between 0.9971 and 0.9976 prior to convergence. The longer training times compared to N-BaIoT reflect both the larger validation sets in cross-validation and the more complex decision boundaries required for the expanded attack taxonomy. Notably, no fold exhibited validation performance substantially worse than training performance at any point, indicating that the regularization configuration remains appropriate for the larger and more diverse dataset. Gal and Ghahramani [65] demonstrated that dropout can be interpreted as approximate Bayesian inference, providing theoretical justification for its effectiveness in preventing overconfident predictions. The dropout rate of 0.4 applied to GNN layers, combined with the weight decay, appears sufficient to control overfitting even with 33 distinct attack types to discriminate [19,21].

Ablation experiments quantified the contribution of individual regularization components. Removing dropout from the GNN layers reduced test accuracy by approximately 1.2 percentage points on N-BaIoT, from 99.88% to 98.7%. Removing batch normalization had an even larger impact, degrading accuracy by 2.1 percentage points to 97.8%. These decrements confirm that both techniques contribute meaningfully to final performance rather than adding complexity without benefit. The batch normalization effect may partially reflect training stability benefits in addition to regularization, as suggested by previous analyses of deep neural network training dynamics [11,12,60]. The combination of dropout, batch normalization, and weight decay appears to provide complementary regularization effects that jointly enable strong generalization across the heterogeneous traffic patterns present in both benchmark datasets. Bottou and Bousquet [66] analyzed the tradeoffs inherent in large-scale learning and noted that regularization becomes increasingly important as model capacity grows relative to training set size.

3.6. Summary of Experimental Findings

The experimental evaluation demonstrates that the hybrid GNN-LSTM architecture achieves excellent performance for IoT botnet detection across two substantially different benchmark datasets. On N-BaIoT, the model attained 99.88% accuracy with F1-score of 0.9988, AUROC of 0.9995, and Brier score of 0.0015. These results compare favorably with previously published approaches on this dataset, including the original deep autoencoder method of Meidan et al. [26] and subsequent machine learning investigations [27,28,29]. The performance exceeds typical results reported for CNN-LSTM hybrids [19], ensemble methods [63], and single-paradigm GNN approaches [49], suggesting the tight integration of graph-based and sequential representations provides synergistic benefits. Five-fold cross-validation on CICIoT2023 yielded 99.73% accuracy with standard deviation of 0.02%, demonstrating consistent performance across data partitions despite the dataset’s substantially greater scale and complexity. The minimal performance degradation when moving from 115 to 39 features, and from 9 to 105 devices, suggests the architecture captures fundamental attack characteristics that generalize across data collection environments.

The calibration analysis confirms that probability outputs satisfy the reliability requirements for risk-based decision making [31,55,56]. Brier scores of 0.0015 and 0.0030 on N-BaIoT and CICIoT2023, respectively, indicate well-calibrated probabilities suitable for the graduated response framework described in Section 2.5. This calibration property distinguishes the approach from many deep learning classifiers that exhibit systematic overconfidence [31]. Security operations teams can therefore use the probability outputs directly for alert prioritization, allocating investigation resources proportionally to predicted threat severity [33,58,67]. Gordon and Loeb [67] analyzed the economics of information security investment and demonstrated that rational resource allocation requires accurate threat probability estimates, precisely the capability that calibrated outputs provide. The threshold sweep analyses further demonstrate stable behavior across operating points, enabling organizations to tune precision–recall tradeoffs according to their specific security requirements and operational constraints [9,51,68].

The consistency between N-BaIoT and CICIoT2023 results provides evidence of architectural robustness that addresses concerns raised by Sommer and Paxson [47] regarding the generalization challenges facing machine learning approaches to intrusion detection. These datasets differ not only in scale but in collection methodology, feature extraction approach, temporal context, and attack composition. N-BaIoT represents controlled laboratory infections from 2018 using Mirai and BASHLITE variants, while CICIoT2023 captures contemporary attack tools across a broader taxonomy spanning DDoS, reconnaissance, web-based attacks, and more. The five-year temporal gap between datasets tests whether learned representations remain valid as attack methodologies evolve [4,5,61]. That the hybrid architecture maintains performance across this temporal and methodological gap suggests it has learned attack characteristics that persist across botnet generations, rather than memorizing signatures specific to particular malware implementations [22,25,69]. Mohaisen and Alrawi [69] emphasized that high-fidelity behavioral analysis is essential for detecting malware variants that deliberately evade signature-based detection. This generalization capability is essential for practical deployment, where the threat landscape evolves continuously and detection systems must adapt to novel attack variants without complete retraining [70,71].

4. Discussion

4.1. Interpretation of Primary Findings

The experimental results raise several questions worth exploring, and perhaps the most pressing concerns why the hybrid architecture performs as well as it does. One possible explanation lies in the complementary nature of what each component captures. Graph neural networks, as Kipf and Welling [11] originally demonstrated, excel at aggregating information from neighboring nodes to construct representations that encode structural relationships. When applied to network traffic, this means the GNN layers can identify patterns that emerge not from individual flows in isolation but from how flows relate to one another across the network topology. The LSTM component, meanwhile, brings a different strength to the table. Hochreiter and Schmidhuber [17] designed these recurrent units specifically to capture dependencies across time, and botnet attacks often exhibit temporal signatures that distinguish them from benign traffic. The combination seems to work because attacks manifest through both dimensions simultaneously. A compromised device does not merely produce anomalous individual packets; it participates in coordinated activity with other infected nodes, and this coordination unfolds over time in ways that neither spatial nor temporal analysis alone would fully reveal.

The performance differential between N-BaIoT and CICIoT2023 deserves closer examination. At first glance, one might expect the model to perform substantially worse on CICIoT2023 given its increased complexity. The attack taxonomy expands from 10 types to 33, the device population grows from 9 to 105, and the feature space shrinks by two thirds. Yet accuracy drops by only 0.15 percentage points. This resilience suggests, though does not definitively prove, that the architecture learns something more fundamental than dataset-specific patterns. Meidan et al. [26] faced exactly the opposite situation when they introduced N-BaIoT; their autoencoder approach required training separate models for each device type because the learned representations did not transfer well across devices. The unified model presented here sidesteps this limitation entirely. Whether this improvement stems from the graph structure, the temporal modeling, or some interaction between them is difficult to disentangle. Wu et al. [44] argued that GNNs possess an inherent capacity for learning representations that generalize across heterogeneous graph structures, and the cross-dataset stability observed here is at least consistent with that hypothesis.

The shift in precision–recall balance between datasets tells an interesting story about how the model adapts to different data characteristics. On N-BaIoT, recall slightly exceeds precision, meaning the model errs toward flagging suspicious traffic even at the cost of occasional false alarms. On CICIoT2023, this relationship inverts. Precision climbs to 99.99% while recall dips to 99.46%, indicating more conservative classification behavior. One interpretation is that the greater attack diversity in CICIoT2023 makes the model more cautious. When the training data includes 33 different attack types rather than 10, the decision boundary must accommodate more variation, and the model apparently responds by requiring stronger evidence before classifying traffic as malicious. From an operational standpoint, both configurations have merit. Security teams drowning in false positives might prefer the CICIoT2023 behavior, while organizations prioritizing detection completeness might favor the N-BaIoT pattern. The threshold sweep analyses in Section 3.3 demonstrate that operators can adjust this tradeoff by selecting alternative classification thresholds, which provides practical flexibility that fixed-threshold approaches lack.

4.2. Comparison with Prior Approaches

Situating these results within the broader literature requires some care, because direct comparisons across studies are complicated by differences in experimental methodology. That said, certain patterns emerge. The original N-BaIoT work by Meidan et al. [26] reported detection rates between 88% and 100% using deep autoencoders, with the variation depending on device type. The 99.88% unified accuracy achieved here falls at the high end of that range while eliminating the need for device-specific models. Alkahtani and Aldhyani [19] applied a CNN-LSTM hybrid to similar IoT botnet data and reported 90.88% accuracy, substantially below the results obtained with the GNN-LSTM approach. Their architecture processed spatial and temporal features through separate pathways before combining them, whereas the approach presented here integrates graph structure directly into the recurrent processing. This tighter coupling may account for some of the performance difference, though other factors including hyperparameter choices and data preprocessing undoubtedly contribute as well.

Graph-based approaches have gained traction in recent intrusion detection research, and the results here add to that growing body of evidence. Esmaeili et al. [22] developed a GNN framework for detecting Mirai, Gafgyt, and Tsunami variants in critical infrastructure contexts, achieving strong results but noting particular difficulty with reconnaissance-phase detection. The same pattern appears in our CICIoT2023 experiments, where the slightly elevated false negative rate likely reflects the challenge of identifying subtle probing behaviors that resemble legitimate network discovery. Zhang et al. [49] combined GNNs with ensemble learning for intrusion detection and similarly found that graph representations improved performance over feature-vector approaches. Duan et al. [23] took a different tack by applying dynamic line graph neural networks with semi-supervised learning, demonstrating that graph structure can reduce the amount of labeled data required for effective training. The convergence of these various research threads suggests that graph-based representations offer genuine advantages for network security applications, not merely incremental improvements but qualitatively different capabilities for capturing relational patterns in traffic data.

The comparison with gradient boosting methods merits specific attention because XGBoost and similar algorithms have achieved impressive results on tabular security data [71]. Al-Akhras et al. [27] reported accuracy around 99.97% using machine learning approaches on N-BaIoT, which actually exceeds the 99.88% obtained here. Does this mean the deep learning approach is unnecessary? Not necessarily. Gradient boosting operates on fixed feature vectors and cannot model temporal evolution or inter-sample relationships without substantial feature engineering. When attacks manifest through behavioral patterns that unfold over time, or through coordination among multiple endpoints, these methods face fundamental limitations that architectural innovations like GNN-LSTM address directly. The computational cost is admittedly higher, as the 50 h training time for CICIoT2023 cross-validation attests, but for applications where detection quality justifies the investment, the hybrid approach offers capabilities that simpler methods cannot match.

4.3. Calibration and Operational Implications

The calibration analysis deserves emphasis because it addresses a concern that much of the intrusion detection literature overlooks. Achieving high accuracy is valuable, but if the probability estimates accompanying those predictions are unreliable, operators cannot make informed decisions about response prioritization. Guo et al. [31] documented that modern neural networks frequently produce overconfident predictions, outputting probabilities near 0 or 1 even when the evidence is ambiguous. The Brier scores of 0.0015 and 0.0030 obtained on N-BaIoT and CICIoT2023, respectively, indicate that this particular architecture avoids that pathology. When the model outputs 80% attack probability, approximately 80% of such samples are indeed attacks. This correspondence between predicted probability and observed frequency enables the graduated response framework outlined in Section 2.5, where different probability thresholds trigger different response protocols.

The practical value of calibrated probabilities extends beyond simple threshold-based classification. Security operations centers routinely face resource constraints that force difficult prioritization decisions [67,68]. Not every alert can receive immediate investigation, so analysts must triage based on perceived severity. If probability outputs reliably indicate threat likelihood, they provide a rational basis for this triage. Gordon and Loeb [67] formalized the economics of security investment and demonstrated that optimal resource allocation depends critically on accurate probability estimates. Miscalibrated models that cry wolf with high probability estimates for benign traffic, or that understate the certainty of genuine attacks, undermine this allocation and may lead to either wasted resources or missed incidents. The calibration curves presented in Figure 13 and Figure 14 show that the proposed architecture avoids both failure modes across the probability range.

Pearl [57] argued decades ago that probabilistic reasoning requires well-calibrated inputs to produce sensible conclusions, and that principle applies directly to security decision-making. Consider a scenario where the model outputs 95% attack probability. If the model is well-calibrated, an analyst can interpret this as a 1-in-20 chance of false alarm and decide accordingly. If the model is overconfident, that same 95% output might actually correspond to 70% true attack probability, fundamentally changing the decision calculus. The empirical verification of calibration quality presented in Section 3.3 provides assurance that probability outputs can be trusted for such reasoning. This assurance matters most in high-stakes situations where incorrect decisions carry significant consequences, precisely the situations where IoT botnet attacks cause the greatest damage [4,5].

4.4. Generalization and the Dataset Shift Problem

Perhaps the most important question this research raises concerns generalization. Sommer and Paxson [47] delivered a sobering critique of machine learning for intrusion detection, arguing that laboratory performance frequently fails to translate to production environments. Their analysis identified several contributing factors, including the difficulty of defining normality in real networks, the adversarial nature of the detection problem, and the tendency of models to learn artifacts specific to particular data collection setups. The experiments presented here cannot definitively refute these concerns, but they do provide encouraging evidence. The five-year gap between N-BaIoT (2018) and CICIoT2023 (2023) means the model trained on one dataset was evaluated against attacks implemented with different tools, targeting different device populations, and captured under different collection methodologies. That performance remains strong despite these differences suggests the learned representations encode something more durable than transient collection artifacts.

Quiñonero-Candela et al. [51] formalized the dataset shift problem and identified several distinct failure modes, including covariate shift where input distributions change, prior probability shift where class proportions change, and concept drift where the relationship between inputs and outputs evolves. The N-BaIoT to CICIoT2023 transition involves all three. The feature distributions differ because different extraction tools were used. The class proportions differ because different sampling strategies were employed. And the underlying attack-behavior relationships differ because botnet implementations evolved over the intervening five years. The robustness observed under these combined shifts is noteworthy, though it would be premature to claim the model would generalize equally well to arbitrary future attacks. Adversarial adaptation remains a persistent concern. Attackers who become aware of detection methods can deliberately modify their behavior to evade them [72,73], and the results here provide no guarantee of robustness against such targeted evasion.

The feature dimensionality reduction from 115 to 39 features raises questions about what information the model actually requires. If performance degrades by only 0.15 percentage points when two-thirds of the features are removed, it suggests that either the original feature set contained substantial redundancy or the model can compensate through its structural and temporal processing. LeCun, Bengio, and Hinton [60] observed that deep architectures excel at discovering useful representations from raw or minimally processed inputs, learning to extract relevant patterns that domain experts might not have anticipated. The GNN component may contribute particularly here by reconstructing relational information from patterns of feature similarity across samples, effectively inferring structure that explicit features would otherwise provide. This interpretation is speculative, and ablation studies targeting specific feature subsets would be needed to test it rigorously.

4.5. Limitations

Several limitations constrain the conclusions that can be drawn from this work. The evaluation focuses exclusively on binary classification, distinguishing attack traffic from benign traffic without identifying specific attack types. Operational deployment would often benefit from more granular classification that could inform targeted response strategies. A DDoS attack warrants different countermeasures than a credential-stuffing campaign, and the current architecture provides no basis for such differentiation. Extending to multi-class classification is technically straightforward but would require more extensive evaluation to verify that the strong binary performance translates to the more challenging multi-way discrimination task. Neto et al. [35] reported substantially lower accuracy for 33-class attack type classification on CICIoT2023, suggesting this extension is nontrivial.

The computational requirements present another practical constraint. Training the model on CICIoT2023 required over 50 h even with modern hardware, and inference latency was not optimized for real-time operation. For deployment in environments where detection speed is critical, such as protecting infrastructure from fast-propagating attacks like those documented by Antonakakis et al. [4], the current implementation may be too slow. Techniques for accelerating graph neural network inference exist [74,75], and applying them to this architecture represents a natural direction for future work. The tradeoff between detection quality and computational efficiency is fundamental, and different deployment contexts will favor different positions along this tradeoff curve.

The reliance on flow-level features rather than packet payloads limits what the model can detect. Encrypted traffic, which constitutes an increasing fraction of both benign and malicious communications [76,77], reveals nothing of its content to flow-based analysis. The model can only identify attacks that manifest through metadata patterns such as packet timing, size distributions, or communication graph structure. Attacks that carefully mimic benign metadata while hiding malicious payloads would evade detection entirely. This limitation is not unique to the proposed approach; it affects all network-level detection methods that respect encryption. Payload inspection would require either operating on unencrypted traffic or deploying endpoint agents that can observe decrypted content, both of which carry their own complications.

Finally, the evaluation used benchmark datasets collected under controlled conditions. Real production networks exhibit characteristics that these datasets may not fully capture, including more diverse device behaviors, more complex traffic patterns, and more sophisticated attacks designed with evasion in mind. Sommer and Paxson [47] emphasized that laboratory evaluation, while necessary, is not sufficient for validating intrusion detection systems. Production deployment would require extensive pilot testing and likely iterative refinement as failure modes emerge. The strong benchmark performance provides a foundation for such deployment but does not guarantee success in the messier reality of operational networks.

The experimental evaluation employed balanced sampling to ensure equal representation of benign and attack traffic. This choice makes comparison with prior benchmark-driven studies more straightforward, but it only partially reflects the conditions encountered in operational networks. In practice, malicious traffic usually constitutes a very small fraction of total volume, often well below one percent, and this imbalance fundamentally changes how detection performance is experienced in real deployments. Metrics obtained under controlled balance therefore describe discriminative behavior under laboratory conditions rather than deployment-level behavior under natural prevalence, where base-rate effects become dominant. While exploratory experiments under imbalanced conditions were conducted during development, they were not carried out in a systematic, deployment-oriented manner. For this reason, we do not claim that the reported results fully capture model behavior under natural traffic prevalence, and this remains an important limitation of the current study.

The temporal structure of the evaluation represents a related consideration. Samples were randomly shuffled and split for training and testing, rather than being separated using a strict time-based partition in which training data precedes test data. This simplifies controlled comparison across models and reduces variance across runs, but it does not fully capture the practical challenge of detecting attack behaviors that evolve over time. In operational settings, models are trained on past observations and must respond to future variants whose characteristics may shift gradually or abruptly. A dedicated time-split evaluation, in which temporal ordering is strictly enforced, would therefore be a necessary step toward assessing robustness under realistic deployment conditions and is left for future work.

A further limitation concerns the explicit assessment of device-level generalization. Although both datasets include traffic from multiple device types, samples from each device appear in both the training and test sets. As a result, the current evaluation does not isolate scenarios in which the model is assessed exclusively on previously unseen devices joining a network. At the same time, the model operates on traffic-level behavioral patterns rather than device-specific identifiers, and cross-dataset validation provides indirect evidence that the learned representations are not tied to individual devices. The extent to which these representations generalize to entirely unseen hardware within the same operational environment cannot, however, be directly quantified under the present evaluation setup. Incorporating device-level holdout protocols alongside time-aware validation and evaluation under natural traffic prevalence would allow a more targeted assessment of long-term deployment robustness.

4.6. Future Research Directions

The results suggest several promising directions for future investigation. Multi-class extension represents the most immediate opportunity. Rather than merely distinguishing attack from benign, a more capable system would identify specific attack categories and perhaps even individual attack variants. This finer granularity would enable more targeted response automation and could support threat intelligence correlation across detection events. The hierarchical structure of attack taxonomies, where specific attacks nest within broader categories, might lend itself to hierarchical classification architectures that first distinguish major attack families before drilling down to specific types. Such approaches have shown promise in other domains with structured label spaces [72].

Adversarial robustness deserves systematic investigation. Zügner et al. [74] demonstrated that graph neural networks can be vulnerable to carefully crafted perturbations that cause misclassification while remaining imperceptible to human observers. Carlini and Wagner [75] developed optimization-based attack methods that reliably fool deep classifiers, and Biggio and Roli [78] traced the decadelong evolution of adversarial machine learning research. Applying these attack methods to the proposed architecture would reveal vulnerabilities that informed adversaries might exploit. More importantly, it would motivate defensive modifications such as adversarial training or certified robustness techniques. Security is inherently adversarial, and detection systems that assume non-adaptive attackers invite eventual circumvention.

Transfer learning offers another avenue worth exploring. The strong cross-dataset performance observed here suggests the model learns transferable representations, but formal transfer learning methods [69,70] might enhance this capability further. Pre-training on large, diverse network traffic corpora before fine-tuning on specific deployment environments could improve performance in data-scarce settings and accelerate adaptation to new attack types. The recent success of foundation models in natural language processing and computer vision suggests that similar pretraining paradigms might benefit security applications, though the heterogeneity of network traffic poses unique challenges that language and image data do not present.

Graph sparsity represents another dimension worth exploring. The similarity threshold τ was fixed in the current experiments, but a sensitivity analysis examining how graph density affects detection performance could inform adaptive threshold selection for different deployment environments.

Interpretability remains an underexplored dimension of graph-based intrusion detection. The GNN-LSTM architecture, like most deep learning systems, operates as a black box that provides predictions without explanations. Rudin [79] argued forcefully that high-stakes decisions should rely on interpretable models rather than opaque ones, and security certainly qualifies as high-stakes. Techniques for explaining GNN predictions exist [72], and adapting them to provide actionable insights about why particular traffic was flagged would increase analyst trust and support more effective response. When a model flags traffic as malicious with 95% probability, analysts would benefit from understanding which features or structural patterns drove that assessment, enabling them to verify the detection rather than blindly trusting algorithmic output.

Real-time operation represents a practical engineering challenge that complements the research directions discussed above. The current implementation prioritizes detection quality over speed, but production deployment often requires subsecond latency to enable timely response. Graph neural network inference can be parallelized effectively [73], and careful implementation might achieve the throughput necessary for inline deployment. Alternatively, the model might operate in a monitoring capacity where near-real-time detection is acceptable, with faster but less accurate methods providing first-line defense. The optimal architecture for any given deployment depends on factors including traffic volume, acceptable latency, and the consequences of missed or delayed detection, tradeoffs that merit explicit consideration in future work.

5. Conclusions

This work set out to address a fairly specific problem: detecting botnet attacks in IoT environments where the traffic patterns are heterogeneous, the device populations are diverse, and the attacks themselves keep evolving. The hybrid GNN-LSTM architecture proposed here appears to handle these challenges reasonably well. On N-BaIoT, the model achieved 99.88% accuracy with an AUROC of 0.9995, and on CICIoT2023 the results remained strong at 99.73% accuracy despite the dataset’s substantially greater complexity. What matters perhaps more than the raw numbers is that the model maintained this performance across two datasets collected five years apart using different methodologies and featuring different attack implementations. That kind of cross-dataset stability has historically been difficult to achieve in intrusion detection research [47], and though the results here do not definitively solve the generalization problem, they do suggest that combining graph-based and temporal representations offers a promising path forward.

The calibration analysis revealed something that often gets overlooked in detection research: the probability estimates themselves are reliable. A Brier score of 0.0015 on N-BaIoT means that when the model outputs 90% probability, it is right about 90% of the time. This property matters for operational deployment because security teams cannot respond to every alert with equal urgency. They need to triage, and calibrated probabilities provide a rational basis for doing so. The threshold flexibility demonstrated in the experiments means that different organizations can tune the precision–recall tradeoff to match their specific operational constraints without retraining the model.

The broader context for this research is the growing recognition that IoT security requires approaches tailored to the unique characteristics of these environments. Traditional intrusion detection methods designed for enterprise networks do not translate directly to settings where thousands of constrained devices generate heterogeneous traffic patterns. The graph-based representation explored here offers one way to capture the relational structure inherent in networked device behavior, while the LSTM component models the temporal dynamics that distinguish coordinated attacks from benign activity. Whether this particular architectural combination proves optimal is less important than the general principle it illustrates: effective IoT security likely requires hybrid approaches that integrate multiple analytical perspectives rather than relying on any single paradigm.

Author Contributions

Conceptualization, T.B. and Y.B.; methodology, T.B. and K.K.; software, T.B. and A.S.; validation, Y.B., D.Y., O.K. and K.S.; formal analysis, T.B., K.K. and A.S.; investigation, K.K., D.Y., O.K. and K.S.; resources, Y.B. and K.S.; data curation, A.S.; writing—original draft preparation, T.B., K.K.; writing—review and editing, K.K., Y.B., D.Y. and K.S.; visualization, K.S., O.K. and D.Y.; supervision, Y.B., D.Y.; project administration, Y.B., D.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This work has been funded by the Science Committee of the Ministry of Science and Higher Education of the Republic of Kazakhstan under Grant No. AP26104787, “Development of solutions for the protection of IoT infrastructure of smart cities in Kazakhstan based on AI”.

Data Availability Statement

N-BaIoT: https://archive.ics.uci.edu/ml/datasets/detection_of_IoT_botnet_attacks_N_BaIoT (accessed on 28 December 2025); CICIoT2023: https://www.unb.ca/cic/datasets/index.html (accessed on 28 December 2025).

Acknowledgments

The authors would like to thank the International IT University for providing computational resources and administrative support for this research project.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AI	Artificial Intelligence
AUC	Area Under the Curve
FN	False Negative
FP	False Positive
GNN	Graph Neural Network
IDS	Intrusion Detection System
IoT	Internet of Things
LSTM	Long Short-Term Memory
MCC	Matthews Correlation Coefficient
ML	Machine Learning
PR	Precision–Recall
ROC	Receiver Operating Characteristic
TN	True Negative
TP	True Positive
TPR	True Positive Rate

References

Weber, R.H. Internet of Things—New Security and Privacy Challenges. Comput. Law Secur. Rev. 2010, 26, 23–30. [Google Scholar] [CrossRef]
Alaba, F.A.; Othman, M.; Hashem, I.A.T.; Alotaibi, F. Internet of Things security: A survey. J. Netw. Comput. Appl. 2017, 88, 10–28. [Google Scholar] [CrossRef]
Bertino, E.; Islam, N. Botnets and Internet of Things Security. Computer 2017, 50, 76–79. [Google Scholar] [CrossRef]
Antonakakis, M.; April, T.; Bailey, M.; Bernhard, M.; Burber, E.; Cochran, J.; Durumeric, Z.; Halderman, J.A.; Invernizzi, L.; Kallitsis, M.; et al. Understanding the Mirai Botnet. In Proceedings of the 26th USENIX Security Symposium, Vancouver, BC, Canada, 16–18 August 2017; pp. 1093–1110. Available online: https://www.semanticscholar.org/paper/220a7eed5c859f596a0d9dbc194034d170a6af51 (accessed on 28 December 2025).
Kolias, C.; Kambourakis, G.; Stavrou, A.; Voas, J. DDoS in the IoT: Mirai and Other Botnets. Computer 2017, 50, 80–84. [Google Scholar] [CrossRef]
Miettinen, M.; Marchal, S.; Hafeez, I.; Asokan, N.; Sadeghi, A.R.; Tarkoma, S. IoT SENTINEL: Automated Device-Type Identification for Security Enforcement in IoT. In Proceedings of the IEEE 37th International Conference on Distributed Computing Systems (ICDCS), Atlanta, GA, USA, 5–8 June 2017; pp. 2177–2184. [Google Scholar] [CrossRef]
Roesch, M. Snort: Lightweight Intrusion Detection for Networks. In Proceedings of the 13th USENIX Conference on System Administration (LISA), Seattle, WA, USA, 7–12 November 1999; pp. 229–238. [Google Scholar]
Chandola, V.; Banerjee, A.; Kumar, V. Anomaly Detection: A Survey. ACM Comput. Surv. 2009, 41, 1–58. [Google Scholar] [CrossRef]
Buczak, A.L.; Guven, E. A Survey of Data Mining and Machine Learning Methods for Cyber Security Intrusion Detection. IEEE Commun. Surv. Tutor. 2016, 18, 1153–1176. [Google Scholar] [CrossRef]
Khraisat, A.; Gondal, I.; Vamplew, P.; Kamruzzaman, J. Survey of Intrusion Detection Systems: Techniques, Datasets and Challenges. Cybersecurity 2019, 2, 20. [Google Scholar] [CrossRef]
Kipf, T.N.; Welling, M. Semi-Supervised Classification with Graph Convolutional Networks. In Proceedings of the 5th International Conference on Learning Representations (ICLR), Toulon, France, 24–26 April 2017; Available online: https://arxiv.org/abs/1609.02907 (accessed on 28 December 2025).
Hamilton, W.L.; Ying, R.; Leskovec, J. Inductive Representation Learning on Large Graphs. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NeurIPS), Long Beach, CA, USA, 4–9 December 2017; pp. 1025–1035. Available online: https://papers.nips.cc/paper/6703-inductive-representation-learning-on-large-graphs (accessed on 28 December 2025).
Zhang, Y.; Chen, Y.; Wang, J. Deep Graph Embedding for IoT Botnet Traffic Detection. Secur. Commun. Netw. 2023, 2023, 9796912. [Google Scholar] [CrossRef]
Bibi, I.; Özçelebi, T.; Meratnia, N. An IoT Attack Detection Framework Leveraging Graph Neural Networks. In Intelligence of Things: Technologies and Applications; Springer: Cham, Switzerland, 2023; pp. 225–236. [Google Scholar] [CrossRef]
Saad, A.M.S.E. Leveraging Graph Neural Networks for Botnet Detection. In Advanced Engineering, Technology and Applications; Springer: Cham, Switzerland, 2024; pp. 145–158. [Google Scholar] [CrossRef]
Zhou, J.; Xu, Z.; Rush, A.M.; Yu, M. Efficient Large-Scale IoT Botnet Detection through GraphSAINT-Based Subgraph Sampling and Graph Isomorphism Network. Mathematics 2024, 12, 1315. [Google Scholar] [CrossRef]
Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
Kim, J.; Kim, J.; Thu, H.L.T.; Kim, H. Long Short Term Memory Recurrent Neural Network Classifier for Intrusion Detection. In Proceedings of the International Conference on Platform Technology and Service (PlatCon), Jeju, Republic of Korea, 15–17 February 2016; pp. 1–5. [Google Scholar] [CrossRef]
Alkahtani, H.; Aldhyani, T.H.H. An Intrusion Detection System to Advance Internet of Things Infrastructure Based on Deep Learning Algorithms. Complexity 2021, 2021, 5579851. [Google Scholar] [CrossRef]
Sinha, P.; Sahu, D.; Prakash, S.; Yang, T.; Rathore, R.S.; Pandey, V.K. A High Performance Hybrid LSTM-CNN Secure Architecture for IoT Environments Using Deep Learning. Sci. Rep. 2025, 15, 9684. [Google Scholar] [CrossRef] [PubMed]
Sayegh, H.R.; Dong, W.; Al-madani, A.M. Enhanced Intrusion Detection with LSTM-Based Model, Feature Selection, and SMOTE for Imbalanced Data. Appl. Sci. 2024, 14, 479. [Google Scholar] [CrossRef]
Esmaeili, B.; Azmoodeh, A.; Dehghantanha, A.; Srivastava, G.; Karimipour, H.; Lin, J.C.W. A GNN-Based Adversarial Internet of Things Malware Detection Framework for Critical Infrastructure: Studying Gafgyt, Mirai and Tsunami Campaigns. IEEE Internet Things J. 2024, 11, 8468–8479. [Google Scholar] [CrossRef]
Duan, G.; Lv, H.; Wang, H.; Feng, G. Application of a Dynamic Line Graph Neural Network for Intrusion Detection with Semi-Supervised Learning. IEEE Trans. Inf. Forensics Secur. 2023, 18, 699–714. [Google Scholar] [CrossRef]
Vitulyova, Y.; Babenko, T.; Kolesnikova, K.; Kiktev, N.; Abramkina, O. A Hybrid Approach Using Graph Neural Networks and LSTM for Attack Vector Reconstruction. Computers 2025, 14, 301. [Google Scholar] [CrossRef]
Friji, H.; Mavromatis, I.; Sanchez-Mompo, A.; Carnelli, P.; Olivereau, A.; Khan, A. Multi-Stage Attack Detection and Prediction Using Graph Neural Networks: An IoT Feasibility Study. In Proceedings of the IEEE 22nd International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom), Exeter, UK, 1–3 November 2023; pp. 1584–1591. [Google Scholar] [CrossRef]
Meidan, Y.; Bohadana, M.; Mathov, Y.; Mirsky, Y.; Shabtai, A.; Breitenbacher, D.; Elovici, Y. N-BaIoT: Network-Based Detection of IoT Botnet Attacks Using Deep Autoencoders. IEEE Pervasive Comput. 2018, 17, 12–22. [Google Scholar] [CrossRef]
Al-Akhras, M.; Alshunaybir, A.; Omar, H.; Alhazmi, S. Botnet attacks detection in IoT environment using machine learning techniques. Int. J. Data Netw. Sci. 2023, 7, 1683–1706. [Google Scholar] [CrossRef]
Nour, M.; Atya, A.O.; Ghali, N.I.; El-Gazar, S.M. Intelligent Detection of IoT Botnets Using Machine Learning and Deep Learning. Appl. Sci. 2020, 10, 7009. [Google Scholar] [CrossRef]
Al Shorman, A.; Faris, H.; Aljarah, I. Unsupervised Intelligent System Based on One Class Support Vector Machine and Grey Wolf Optimization for IoT Botnet Detection. J. Ambient Intell. Humaniz. Comput. 2020, 11, 2809–2825. [Google Scholar] [CrossRef]
Kasongo, S.M.; Sun, Y. Performance Analysis of Intrusion Detection Systems Using a Feature Selection Method on the UNSW-NB15 Dataset. J. Big Data 2020, 7, 105. [Google Scholar] [CrossRef]
Guo, C.; Pleiss, G.; Sun, Y.; Weinberger, K.Q. On Calibration of Modern Neural Networks. In Proceedings of the 34th International Conference on Machine Learning (ICML), Sydney, Australia, 6–11 August 2017; pp. 1321–1330. Available online: https://dl.acm.org/doi/10.5555/3305381.3305518 (accessed on 28 December 2025).
Brier, G.W. Verification of Forecasts Expressed in Terms of Probability. Mon. Weather Rev. 1950, 78, 1–3. [Google Scholar] [CrossRef]
Babenko, T.; Kolesnikova, K.; Abramkina, O.; Vitulyova, Y. Automated OSINT Techniques for Digital Asset Discovery and Cyber Risk Assessment. Computers 2025, 14, 430. [Google Scholar] [CrossRef]
Olekh, H.; Kolesnikova, K.; Olekh, T.; Mezenceva, O. Environmental impact assessment procedure as the implementation of the value approach in environmental projects. In CEUR Workshop Proceedings; CEUR-WS.org: Aachen, Germany, 2021; Volume 2870, pp. 1–10. Available online: https://ceur-ws.org/Vol-2870/ (accessed on 28 December 2025).
Neto, E.C.P.; Dadkhah, S.; Ferber, R.; Zohourian, A.; Lu, R.; Ghorbani, A.A. CICIoT2023: A Real-Time Dataset and Benchmark for Large-Scale Attacks in IoT Environment. Sensors 2023, 23, 5941. [Google Scholar] [CrossRef]
Dadkhah, S.; Mahdikhani, H.; Danber, P.K.; Ghorbani, A.A. Towards the Development of a Realistic Multidimensional IoT Profiling Dataset. In Proceedings of the 19th International Conference on Privacy, Security and Trust (PST), Fredericton, NB, Canada, 22–24 August 2022; pp. 1–11. [Google Scholar] [CrossRef]
Babenko, T.; Toliupa, S.; Kovalova, Y. LVQ models of DDOS attacks identification. In Proceedings of the 14th International Conference on Advanced Trends in Radioelectronics, Telecommunications and Computer Engineering (TCSET), Lviv-Slavske, Ukraine, 20–24 February 2018; pp. 1–4. [Google Scholar] [CrossRef]
Hnatiienko, H.; Hnatiienko, V.; Zulunov, R.; Babenko, T.; Myrutenko, L. Method for determining the level of criticality elements when ensuring the functional stability of the system based on role analysis of elements. In CEUR Workshop Proceedings; CEUR-WS.org: Aachen, Germany, 2024; Available online: https://ceur-ws.org/ (accessed on 28 December 2025).
Grechko, V.; Babenko, T.; Myrutenko, L. Secure software developing recommendations. In Proceedings of the 2019 IEEE International Scientific-Practical Conference: Problems of Infocommunications Science and Technology (PIC S&T), Kyiv, Ukraine, 8–11 October 2019. [Google Scholar] [CrossRef]
Babenko, T.; Kolesnikova, K.; Panchenko, M.; Meish, Y.; Mazurchuk, P. Risk assessment of cryptojacking attacks on endpoint systems: Threats to sustainable digital agriculture. Sustainability 2025, 17, 542. [Google Scholar] [CrossRef]
Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Vancouver, BC, Canada, 8–14 December 2019; Volume 32, pp. 8024–8035. Available online: https://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf (accessed on 28 December 2025).
Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. In Proceedings of the 3rd International Conference on Learning Representations (ICLR), San Diego, CA, USA, 7–9 May 2015; Available online: https://arxiv.org/pdf/1412.6980.pdf (accessed on 28 December 2025).
Raschka, S. Model Evaluation, Model Selection, and Algorithm Selection in Machine Learning. arXiv 2018, arXiv:1811.12808. [Google Scholar] [CrossRef]
Wu, Z.; Pan, S.; Chen, F.; Long, G.; Zhang, C.; Yu, P.S. A Comprehensive Survey on Graph Neural Networks. IEEE Trans. Neural Netw. Learn. Syst. 2021, 32, 4–24. [Google Scholar] [CrossRef]
Axelsson, S. The Base-Rate Fallacy and the Difficulty of Intrusion Detection. ACM Trans. Inf. Syst. Secur. 2000, 3, 186–205. [Google Scholar] [CrossRef]
Johnson, J.M.; Khoshgoftaar, T.M. Survey on Deep Learning with Class Imbalance. J. Big Data 2019, 6, 27. [Google Scholar] [CrossRef]
Sommer, R.; Paxson, V. Outside the Closed World: On Using Machine Learning for Network Intrusion Detection. In Proceedings of the IEEE Symposium on Security and Privacy (S&P), Oakland, CA, USA, 16–19 May 2010; pp. 305–316. [Google Scholar] [CrossRef]
Hanley, J.A.; McNeil, B.J. The Meaning and Use of the Area under a Receiver Operating Characteristic (ROC) Curve. Radiology 1982, 143, 29–36. [Google Scholar] [CrossRef]
Zhang, L.; Wang, Y.; Wu, M.; Wang, Y.; Zheng, Y. Network Intrusion Detection Based on Graph Neural Network and Ensemble Learning. In Proceedings of the 2023 CAA Symposium on Fault Detection, Supervision and Safety for Technical Processes (SAFEPROCESS), Yibin, China, 22–24 September 2023; pp. 1–6. [Google Scholar] [CrossRef]
Davis, J.; Goadrich, M. The Relationship between Precision-Recall and ROC Curves. In Proceedings of the 23rd International Conference on Machine Learning (ICML), Pittsburgh, PA, USA, 25–29 June 2006; pp. 233–240. [Google Scholar] [CrossRef]
Quiñonero-Candela, J.; Sugiyama, M.; Schwaighofer, A.; Lawrence, N.D. (Eds.) Dataset Shift in Machine Learning; MIT Press: Cambridge, MA, USA, 2009; ISBN 978-0-262-17005-5. [Google Scholar] [CrossRef]
Gates, C.; Taylor, C. Challenging the Anomaly Detection Paradigm: A Provocative Discussion. In Proceedings of the 15th New Security Paradigms Workshop (NSPW), Schloss Dagstuhl, Germany, 19–22 September 2006; pp. 21–29. [Google Scholar] [CrossRef]
Xu, K.; Hu, W.; Leskovec, J.; Jegelka, S. How Powerful Are Graph Neural Networks? In Proceedings of the 7th International Conference on Learning Representations (ICLR), New Orleans, LA, USA, 6–9 May 2019; Available online: https://cs.stanford.edu/people/jure/pubs/gin-iclr19.pdf (accessed on 28 December 2025).
Flach, P. Performance Evaluation in Machine Learning: The Good, the Bad, the Ugly, and the Way Forward. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 9808–9814. [Google Scholar] [CrossRef]
Niculescu-Mizil, A.; Caruana, R. Predicting Good Probabilities with Supervised Learning. In Proceedings of the 22nd International Conference on Machine Learning (ICML), Bonn, Germany, 7–11 August 2005; pp. 625–632. [Google Scholar] [CrossRef]
Bishop, C.M. Pattern Recognition and Machine Learning; Springer: New York, NY, USA, 2006; ISBN 978-0-387-31073-2. [Google Scholar]
Pearl, J. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference; Morgan Kaufmann: San Mateo, CA, USA, 1988; ISBN 978-1-55860-479-7. [Google Scholar]
Klein, G.; Calderwood, R.; Clinton-Cirocco, A. Rapid Decision Making on the Fire Ground. Proc. Hum. Factors Soc. Annu. Meet. 1986, 30, 576–580. [Google Scholar] [CrossRef]
Hutchins, E.M.; Cloppert, M.J.; Amin, R.M. Intelligence-Driven Computer Network Defense Informed by Analysis of Adversary Campaigns and Intrusion Kill Chains. In Proceedings of the 6th International Conference on Information Warfare and Security (ICIW), Washington, DC, USA, 17–18 March 2011; pp. 113–125. Available online: https://www.lockheedmartin.com/content/dam/lockheed-martin/rms/documents/cyber/LM-White-Paper-Intel-Driven-Defense.pdf (accessed on 28 December 2025).
LeCun, Y.; Bengio, Y.; Hinton, G. Deep Learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef] [PubMed]
Staniford, S.; Paxson, V.; Weaver, N. How to Own the Internet in Your Spare Time. In Proceedings of the 11th USENIX Security Symposium, San Francisco, CA, USA, 5–9 August 2002; pp. 149–167. Available online: https://www.usenix.org/conference/11th-usenix-security-symposium/how-own-internet-your-spare-time (accessed on 28 December 2025).
Plohmann, D.; Yakdan, K.; Klatt, M.; Bader, J.; Gerhards-Padilla, E. A Comprehensive Measurement Study of Domain Generating Malware. In Proceedings of the 25th USENIX Security Symposium, Austin, TX, USA, 10–12 August 2016; pp. 263–278. Available online: https://www.usenix.org/conference/usenixsecurity16/technical-sessions/presentation/plohmann (accessed on 28 December 2025).
Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar] [CrossRef]
Masters, D.; Luschi, C. Revisiting Small Batch Training for Deep Neural Networks. arXiv 2018, arXiv:1804.07612. [Google Scholar] [CrossRef]
Gal, Y.; Ghahramani, Z. Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning. In Proceedings of the 33rd International Conference on Machine Learning (ICML), New York, NY, USA, 19–24 June 2016; pp. 1050–1059. Available online: https://proceedings.mlr.press/v48/gal16.html (accessed on 28 December 2025).
Bottou, L.; Bousquet, O. The Tradeoffs of Large Scale Learning. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Vancouver, BC, Canada, 3–6 December 2007; Volume 20, pp. 161–168. Available online: https://proceedings.neurips.cc/paper/2007/hash/0d3180d672e08b4c5312dcdafdf6ef36-Abstract.html (accessed on 28 December 2025).
Gordon, L.A.; Loeb, M.P. The Economics of Information Security Investment. ACM Trans. Inf. Syst. Secur. 2002, 5, 438–457. [Google Scholar] [CrossRef]
Anderson, R. Why Information Security Is Hard: An Economic Perspective. In Proceedings of the 17th Annual Computer Security Applications Conference (ACSAC), New Orleans, LA, USA, 10–14 December 2001; pp. 358–365. [Google Scholar] [CrossRef]
Mohaisen, A.; Alrawi, O. AMAL: High-Fidelity, Behavior-Based Automated Malware Analysis and Classification. In Information Security Applications (WISA 2014); Rhee, K.-H., Yi, J.H., Eds.; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2015; Volume 9503, pp. 107–121. [Google Scholar] [CrossRef]
Pan, S.J.; Yang, Q. A Survey on Transfer Learning. IEEE Trans. Knowl. Data Eng. 2010, 22, 1345–1359. [Google Scholar] [CrossRef]
Weiss, K.; Khoshgoftaar, T.M.; Wang, D. A Survey of Transfer Learning. J. Big Data 2016, 3, 9. [Google Scholar] [CrossRef]
Zügner, D.; Akbarnejad, A.; Günnemann, S. Adversarial Attacks on Neural Networks for Graph Data. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD), London, UK, 19–23 August 2018; pp. 2847–2856. [Google Scholar] [CrossRef]
Carlini, N.; Wagner, D. Towards Evaluating the Robustness of Neural Networks. In Proceedings of the IEEE Symposium on Security and Privacy (S&P), San Jose, CA, USA, 22–24 May 2017; pp. 39–57. [Google Scholar] [CrossRef]
Ying, R.; Bourgeois, D.; You, J.; Zitnik, M.; Leskovec, J. GNNExplainer: Generating Explanations for Graph Neural Networks. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Vancouver, BC, Canada, 8–14 December 2019; Volume 32, pp. 9240–9251. Available online: https://proceedings.neurips.cc/paper/2019/hash/d80b7040b773199015de6d3b4293c8ff-Abstract.html (accessed on 28 December 2025).
Chen, J.; Ma, T.; Xiao, C. FastGCN: Fast Learning with Graph Convolutional Networks via Importance Sampling. In Proceedings of the 6th International Conference on Learning Representations (ICLR), Vancouver, BC, Canada, 30 April–3 May 2018; Available online: https://openreview.net/forum?id=rytstxWAW (accessed on 28 December 2025).
Kerschbaumer, C.; Braun, F.; Friedberger, S.; Jürgens, M. The State of HTTPS Adoption on the Web. In Proceedings of the 2025 Workshop on Measurements, Attacks, and Defenses for the Web (MADWeb), San Francisco, CA, USA, 27 February 2025. [Google Scholar] [CrossRef]
Sherry, J.; Lan, C.; Popa, R.A.; Ratnasamy, S. BlindBox: Deep Packet Inspection over Encrypted Traffic. In Proceedings of the ACM SIGCOMM Conference, London, UK, 17–21 August 2015; Volume 45, pp. 213–226. [Google Scholar] [CrossRef]
Biggio, B.; Roli, F. Wild Patterns: Ten Years After the Rise of Adversarial Machine Learning. Pattern Recognit. 2018, 84, 317–333. [Google Scholar] [CrossRef]
Rudin, C. Stop Explaining Black Box Machine Learning Models for High Stakes Decisions and Use Interpretable Models Instead. Nat. Mach. Intell. 2019, 1, 206–215. [Google Scholar] [CrossRef]

Figure 1. Overall framework architecture showing three-stage processing pipeline.

Figure 2. Class distribution across devices in the N-BaIoT dataset showing severe imbalance between benign and attack samples.

Figure 3. Attack category distribution in CICIoT2023 dataset showing DDoS attacks as the dominant category.

Figure 4. Feature space comparison between N-BaIoT and CICIoT2023 datasets showing dimensional differences and feature type overlap.

Figure 5. Graph construction in the proposed framework. Top: Temporal evolution of graph connectivity during botnet activity, illustrating increasing edge density from benign behavior (t = 0) through infection (t = 10) to propagation (t = 20). Bottom: Graph construction pipeline. Raw network flows are transformed into statistical feature vectors, which are treated as nodes. Pairwise similarity is computed in feature space, and adaptive thresholding is applied to construct the adjacency matrix used for GNN processing.

Figure 6. Confusion matrix of the N-BaIoT test set displaying the classification distribution across benign (TN = 74,851, FP = 138) and attack (FN = 48, TP = 74,941) classes.

Figure 7. Receiver operating characteristic curve for N-BaIoT classification demonstrating AUROC = 0.9995.

Figure 8. Precision–recall curve for N-BaIoT with AUCPR = 0.9993.

Figure 9. Mean confusion matrix across five-fold cross-validation on CICIoT2023 showing per-fold averages (TN = 39,985, FP = 3, FN = 216, TP = 39,772).

Figure 10. Box plot visualization of metric distributions across five-fold cross-validation on CICIoT2023. Different colors correspond to different evaluation metrics shown on the x-axis.

Figure 11. ROC curves for all five folds on CICIoT2023, demonstrating consistent discrimination with AUROC range 0.9981–0.9985.

Figure 12. Precision–recall curves across five-fold cross-validation on CICIoT2023 with AUCPR range 0.9988–0.9990.

Figure 13. Calibration curve for N-BaIoT showing predicted probability versus observed frequency with Brier score = 0.0015.

Figure 14. Calibration curve for CICIoT2023 averaged across five-fold cross-validation.

Figure 15. Probability distribution histograms for benign and attack classes on N-BaIoT.

Figure 16. Probability distribution histograms for CICIoT2023 across cross-validation folds.

Figure 17. Threshold sweep analysis for N-BaIoT showing precision, recall, and F1-score as functions of classification threshold. The red dot indicates the selected operating threshold corresponding to the optimal F1-score.

Figure 18. Threshold sweep analysis for CICIoT2023 averaged across cross-validation folds. The red dot in-dicates the selected operating threshold corresponding to the optimal F1-score.

Figure 19. Training history for N-BaIoT showing loss and accuracy evolution across epochs with early stopping at epoch 17.

Table 1. N-BaIoT dataset device characteristics and traffic distribution.

Device Type	Model	Benign Samples	Attack Samples	Traffic Pattern
Doorbell	Danmini	49,548	1,842,674	Event-driven bursts
Baby monitor	Philips B120N/10	175,240	1,925,150	Continuous low-latency stream
Security camera	Provision PT-737E	62,154	869,306	Periodic heartbeat, motion-triggered
Security camera	Provision PT-838	98,514	932,446	HD streaming, bandwidth-intensive
Webcam	SimpleHome XCS7-100	246,585	524,656	Variable bitrate, adaptive quality
Webcam	SimpleHome XCS7-100	319,528	359,872	Fixed intervals, predictable pattern
Thermostat	Ecobee	13,113	806,886	Sparse status updates
Socket	Edimax SP-2101W	8135	385,949	Command-response, binary states
Motion sensor	Samsung SNH-1011N	52,150	615,028	Event detection, alert propagation

Table 2. CICIoT2023 dataset attack categories and sample distribution.

Attack Category	Attack Types Included	Sample Count	Percentage
DDoS	ACK Fragmentation, UDP Flood, SYN Flood, ICMP Flood, HTTP Flood, SlowLoris, TCP Flood, PSHACK Flood, RSTFIN Flood, UDP Fragmentation, ICMP Fragmentation, Synonymous IP Flood	28,534,126	60.7%
DoS	TCP Flood, UDP Flood, HTTP Flood	6,823,417	14.5%
Mirai	Greip Flood, Greeth Flood, UDPPlain	4,912,553	10.4%
Reconnaissance	Host Discovery, Port Scanning, OS Fingerprinting, Vulnerability Scanning	3,245,891	6.9%
Spoofing	ARP Spoofing, DNS Spoofing	1,823,445	3.9%
Web-based	SQL Injection, Command Injection, XSS, Browser Hijacking, Backdoor Malware	1,156,234	2.5%
Brute Force	Dictionary Attack	478,923	1.0%
Total attack	—	46,974,589	100%
Benign	Normal traffic from 105 devices	1,035,721	—

Table 3. Experimental dataset configurations.

Parameter	N-BaIoT	CICIoT2023
Total samples used	1,000,000	400,000
Samples per class	500,000	200,000
Number of features	115	39
Number of devices	9	105
Number of attack types	10	33
Collection year	2018	2023
Training samples	700,000	272,000 per fold
Validation samples	150,000	48,000 per fold
Test samples	150,000	80,000 per fold
Evaluation method	Single stratified split	5-fold stratified CV
Sequence length (T)	24	24
Batch size	128	128

Table 4. Hybrid model architecture and hyperparameters.

Component	Parameter	Value
GNN Layer 1	Input/Output Dimensions	115 → 64
GNN Layer 1	Activation Function	ReLU + Batch Norm
GNN Layer 2	Input/Output Dimensions	64 → 64
GNN Layer 2	Dropout Rate	0.4
Graph Pooling	Aggregation Method	Global Mean Pool
LSTM	Hidden Units	64
LSTM	Number of Layers	1
LSTM	Sequence Length	24
Output Layer	Architecture	Linear (64 → 2)
Training	Optimizer	Adam (lr = 0.0001, weight_decay = 0.01)

Table 5. Risk score calibration performance metrics.

Metric	Before Calibration	After Calibration	Improvement
Expected Calibration Error	0.038	0.012	68.4%
Brier Score	0.0032	0.0015	53.1%
Log Loss	0.0120	0.0058	51.7%
AUC-ROC	0.9990	0.9995	0.05%
Mean Confidence Error	0.042	0.018	57.1%

Table 6. Classification performance metrics on N-BaIoT test set.

Metric	Value	Metric	Value
Accuracy	99.88%	Specificity	99.82%
Precision	99.82%	False Positive Rate	0.18%
Recall	99.94%	False Negative Rate	0.06%
F1-Score	0.9988	MCC	0.9975
AUROC	0.9995	Cohen’s Kappa	0.9975

Table 7. Five-fold cross-validation results on CICIoT2023 (mean ± standard deviation).

Metric	Mean ± Std	95% CI	Range
Accuracy	99.73% ± 0.02%	99.71–99.75%	99.70–99.75%
Precision	99.99% ± 0.01%	99.98–100.00%	99.98–100.00%
Recall	99.46% ± 0.04%	99.42–99.50%	99.39–99.51%
Specificity	99.99% ± 0.01%	99.98–100.00%	99.97–100.00%
F1-Score	0.9972 ± 0.0002	0.9970–0.9974	0.9970–0.9975
AUROC	0.9984 ± 0.0002	0.9982–0.9986	0.9981–0.9985
MCC	0.9945 ± 0.0004	0.9941–0.9949	0.9939–0.9950
Brier Score	0.0030 ± 0.0002	0.0028–0.0032	0.0027–0.0032

Table 8. Per-fold performance breakdown for CICIoT2023 cross-validation.

Fold	Accuracy	Precision	Recall	F1	AUROC	Brier
1	99.71%	99.98%	99.43%	0.9971	0.9985	0.0031
2	99.75%	99.99%	99.51%	0.9975	0.9985	0.0027
3	99.74%	99.99%	99.49%	0.9974	0.9984	0.0029
4	99.70%	100.00%	99.39%	0.9970	0.9981	0.0032
5	99.73%	99.99%	99.47%	0.9973	0.9985	0.0029

Table 9. Probabilistic performance metrics comparison between datasets.

Metric	N-BaIoT	CICIoT2023
Brier Score	0.0015	0.0030 ± 0.0002
AUROC	0.9995	0.9984 ± 0.0002
AUCPR	0.9993	0.9990 ± 0.0001
Expected Calibration Error	0.012	0.018 ± 0.003
Log Loss	0.0058	0.0112 ± 0.0008

Table 10. Cross-dataset performance comparison.

Metric	N-BaIoT	CICIoT2023	Δ
Accuracy	99.88%	99.73%	−0.15%
Precision	99.82%	99.99%	+0.17%
Recall	99.94%	99.46%	−0.48%
F1-Score	0.9988	0.9972	−0.0016
AUROC	0.9995	0.9984	−0.0011
Brier Score	0.0015	0.0030	+0.0015
MCC	0.9975	0.9945	−0.0030
Features	115	39	−66%
Device count	9	105	+1067%
Attack types	10	33	+230%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Babenko, T.; Kolesnikova, K.; Bakhtiyarova, Y.; Yeskendirova, D.; Sansyzbay, K.; Sysoyev, A.; Kruchinin, O. Hybrid GNN–LSTM Architecture for Probabilistic IoT Botnet Detection with Calibrated Risk Assessment. Computers 2026, 15, 26. https://doi.org/10.3390/computers15010026

AMA Style

Babenko T, Kolesnikova K, Bakhtiyarova Y, Yeskendirova D, Sansyzbay K, Sysoyev A, Kruchinin O. Hybrid GNN–LSTM Architecture for Probabilistic IoT Botnet Detection with Calibrated Risk Assessment. Computers. 2026; 15(1):26. https://doi.org/10.3390/computers15010026

Chicago/Turabian Style

Babenko, Tetiana, Kateryna Kolesnikova, Yelena Bakhtiyarova, Damelya Yeskendirova, Kanibek Sansyzbay, Askar Sysoyev, and Oleksandr Kruchinin. 2026. "Hybrid GNN–LSTM Architecture for Probabilistic IoT Botnet Detection with Calibrated Risk Assessment" Computers 15, no. 1: 26. https://doi.org/10.3390/computers15010026

APA Style

Babenko, T., Kolesnikova, K., Bakhtiyarova, Y., Yeskendirova, D., Sansyzbay, K., Sysoyev, A., & Kruchinin, O. (2026). Hybrid GNN–LSTM Architecture for Probabilistic IoT Botnet Detection with Calibrated Risk Assessment. Computers, 15(1), 26. https://doi.org/10.3390/computers15010026

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Hybrid GNN–LSTM Architecture for Probabilistic IoT Botnet Detection with Calibrated Risk Assessment

Abstract

1. Introduction

2. Materials and Methods

2.1. Overall Framework

2.2. Dataset Description

2.2.1. N-BaIoT Dataset

2.2.2. CICIoT2023 Dataset

2.2.3. Data Preprocessing

2.3. Graph Construction and Feature Representation

2.4. Hybrid Model Architecture

2.5. Risk Scoring and Calibration

3. Results

3.1. Classification Performance on N-BaIoT

3.2. Cross-Validation Results on CICIoT2023

3.3. Probability Calibration Analysis

3.4. Cross-Dataset Comparison and Generalization Assessment

3.5. Training Dynamics and Convergence Analysis

3.6. Summary of Experimental Findings

4. Discussion

4.1. Interpretation of Primary Findings

4.2. Comparison with Prior Approaches

4.3. Calibration and Operational Implications

4.4. Generalization and the Dataset Shift Problem

4.5. Limitations

4.6. Future Research Directions

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI